File Download
Supplementary

postgraduate thesis: Multimodal vision-language representation learning

TitleMultimodal vision-language representation learning
Authors
Advisors
Advisor(s):Luo, PWong, KKY
Issue Date2023
PublisherThe University of Hong Kong (Pokfulam, Hong Kong)
Citation
Ge Yuying, [葛玉莹]. (2023). Multimodal vision-language representation learning. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.
AbstractVision-language representation learning aims to encode general-purpose representations of videos and texts that transfer well to diverse downstream tasks, through exploiting large-scale Internet data. The learned representations should possess the capability of encoding both visual and textual information, as well as reasoning about the relationships between them. This thesis investigates vision-language representation learning for various applications including (i) multimodal video-text tasks; (ii) core computer vision tasks; (iii) robotic manipulation tasks. I first propose novel methods of pre-training a model to learn transferable video-text representations for downstream retrieval, which aims to promote local feature learning while maintaining high efficiency. Specifically, I leverage the rich semantics of texts (i.e., nouns and verbs) to build questions, with which the video encoder can be trained to capture more regional spatial content and temporal dynamics. Additionally, I explore masked visual modeling with injected language semantics in video-text pre-training, which strengthens both the awareness of local visual features and the fine-grained cross- modality alignment. I further exploit language semantics to enhance spatiotemporal video representations for downstream action recognition. As video data is naturally multimodal with transcribed speech knowledge in the form of automatic speech recognition (ASR) transcripts, I use the time-dependent ASR transcripts to regularize the model to learn transferable video representations for spatial and temporal reasoning. Moreover, I utilize the generalization ability of vision-language pre-trained models to build a robot that can adapt to unseen tasks and environments. When a trained pol- icy is deployed in a new task or environment, the feedback from pre-trained models is used to fine-tune the policy in an automatic way. Finally, I discuss some future work towards vision-language representation learning for cognitive-level intelligent robots.
DegreeDoctor of Philosophy
SubjectComputer vision
Machine learning
Dept/ProgramComputer Science
Persistent Identifierhttp://hdl.handle.net/10722/330278

 

DC FieldValueLanguage
dc.contributor.advisorLuo, P-
dc.contributor.advisorWong, KKY-
dc.contributor.authorGe Yuying-
dc.contributor.author葛玉莹-
dc.date.accessioned2023-08-31T09:18:27Z-
dc.date.available2023-08-31T09:18:27Z-
dc.date.issued2023-
dc.identifier.citationGe Yuying, [葛玉莹]. (2023). Multimodal vision-language representation learning. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.-
dc.identifier.urihttp://hdl.handle.net/10722/330278-
dc.description.abstractVision-language representation learning aims to encode general-purpose representations of videos and texts that transfer well to diverse downstream tasks, through exploiting large-scale Internet data. The learned representations should possess the capability of encoding both visual and textual information, as well as reasoning about the relationships between them. This thesis investigates vision-language representation learning for various applications including (i) multimodal video-text tasks; (ii) core computer vision tasks; (iii) robotic manipulation tasks. I first propose novel methods of pre-training a model to learn transferable video-text representations for downstream retrieval, which aims to promote local feature learning while maintaining high efficiency. Specifically, I leverage the rich semantics of texts (i.e., nouns and verbs) to build questions, with which the video encoder can be trained to capture more regional spatial content and temporal dynamics. Additionally, I explore masked visual modeling with injected language semantics in video-text pre-training, which strengthens both the awareness of local visual features and the fine-grained cross- modality alignment. I further exploit language semantics to enhance spatiotemporal video representations for downstream action recognition. As video data is naturally multimodal with transcribed speech knowledge in the form of automatic speech recognition (ASR) transcripts, I use the time-dependent ASR transcripts to regularize the model to learn transferable video representations for spatial and temporal reasoning. Moreover, I utilize the generalization ability of vision-language pre-trained models to build a robot that can adapt to unseen tasks and environments. When a trained pol- icy is deployed in a new task or environment, the feedback from pre-trained models is used to fine-tune the policy in an automatic way. Finally, I discuss some future work towards vision-language representation learning for cognitive-level intelligent robots.-
dc.languageeng-
dc.publisherThe University of Hong Kong (Pokfulam, Hong Kong)-
dc.relation.ispartofHKU Theses Online (HKUTO)-
dc.rightsThe author retains all proprietary rights, (such as patent rights) and the right to use in future works.-
dc.rightsThis work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.-
dc.subject.lcshComputer vision-
dc.subject.lcshMachine learning-
dc.titleMultimodal vision-language representation learning-
dc.typePG_Thesis-
dc.description.thesisnameDoctor of Philosophy-
dc.description.thesislevelDoctoral-
dc.description.thesisdisciplineComputer Science-
dc.description.naturepublished_or_final_version-
dc.date.hkucongregation2023-
dc.identifier.mmsid991044717470103414-

Export via OAI-PMH Interface in XML Formats


OR


Export to Other Non-XML Formats