Learning vision-language representation for multimodal understanding

Wang, Teng; 王腾

File Download

FullText.pdf

Supplementary

Citations:
Appears in Collections:
- HKU Theses Online
- Computer Science: Theses

postgraduate thesis: Learning vision-language representation for multimodal understanding

Title	Learning vision-language representation for multimodal understanding
Authors	Wang, Teng 王腾
Issue Date	2024
Publisher	The University of Hong Kong (Pokfulam, Hong Kong)
Citation	Wang, T. [王腾]. (2024). Learning vision-language representation for multimodal understanding. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.
Abstract	Humans comprehend and interact with their surroundings through the integration of multi-sensory information, including visual, linguistic, and auditory cues. The field of vision-language representation learning is dedicated to enabling machines to learn multimodal associations and interactions between visual and textual data. This thesis tackles three pivotal problems: scalability of the pretraining data, efficiency of the pretraining objectives and fine-grained vision-language alignments. Regarding data scalability, we focus on scalable vision-language representation learning that leverages unpaired images and texts. To enhance the implicit alignments between modalities and augment data diversity, we introduce cross-modal cutmix, a technique for blending visual patches with sentences to create multimodal sentences, i.e., a multimodal view of a sentence. By incorporating diverse multimodal sentences into contrastive learning, instance-level alignments between textual and multimodal samples are effectively exploited. Our model circumvents the constraints of paired datasets, facilitating scalable multimodal representation learning with a broader and more varied collection of unpaired data. In terms of learning efficiency, we investigate the acceleration method of vision-language pretraining. We empirically find that an essential obstacle to training efficiency lies in the entangled prediction rate (percentage of tokens for reconstruction) and corruption rate (percentage of corrupted tokens) in masked language modeling, that is, a proper corruption rate is achieved at the cost of a large portion of output tokens being excluded from prediction loss. To overcome the limitation, we propose free language modeling (FLM), a new pretraining objective that decouples the prediction rate from the corruption rate in masked language modeling. Our method achieves faster convergence by allowing customization of corruption spans for each token, while maintaining competitive performance on downstream vision-language tasks. Concerning cross-modal alignment granularity, we delve into fine-grained alignments between untrimmed videos and natural language. We propose a grounded vision-language learning (GVL) framework for untrimmed videos, focusing on detecting informative events and aligning multi-sentence descriptions with corresponding event segments. We introduce the parallel decoding paradigm for dense video captioning (PDVC) to segment videos effectively, enhancing the coherence and readability of generated dense captions. Furthermore, two dual pretext tasks are proposed to encourage fine-grained segment-level alignments: text-to-event contrast and event-to-text generation. The framework is versatile and applicable to visually-grounded language understanding and generation tasks. We conduct extensive experiments to validate our proposed methodologies. These efforts not only advance the frontiers of multimodal learning but also pave the way for more efficient and effective integration of vision and language in machine intelligence systems. (400 words)
Degree	Doctor of Philosophy
Subject	Computer vision Machine learning
Dept/Program	Computer Science
Persistent Identifier	http://hdl.handle.net/10722/350289

DC Field	Value	Language
dc.contributor.author	Wang, Teng	-
dc.contributor.author	王腾	-
dc.date.accessioned	2024-10-23T09:45:56Z	-
dc.date.available	2024-10-23T09:45:56Z	-
dc.date.issued	2024	-
dc.identifier.citation	Wang, T. [王腾]. (2024). Learning vision-language representation for multimodal understanding. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.	-
dc.identifier.uri	http://hdl.handle.net/10722/350289	-
dc.description.abstract	Humans comprehend and interact with their surroundings through the integration of multi-sensory information, including visual, linguistic, and auditory cues. The field of vision-language representation learning is dedicated to enabling machines to learn multimodal associations and interactions between visual and textual data. This thesis tackles three pivotal problems: scalability of the pretraining data, efficiency of the pretraining objectives and fine-grained vision-language alignments. Regarding data scalability, we focus on scalable vision-language representation learning that leverages unpaired images and texts. To enhance the implicit alignments between modalities and augment data diversity, we introduce cross-modal cutmix, a technique for blending visual patches with sentences to create multimodal sentences, i.e., a multimodal view of a sentence. By incorporating diverse multimodal sentences into contrastive learning, instance-level alignments between textual and multimodal samples are effectively exploited. Our model circumvents the constraints of paired datasets, facilitating scalable multimodal representation learning with a broader and more varied collection of unpaired data. In terms of learning efficiency, we investigate the acceleration method of vision-language pretraining. We empirically find that an essential obstacle to training efficiency lies in the entangled prediction rate (percentage of tokens for reconstruction) and corruption rate (percentage of corrupted tokens) in masked language modeling, that is, a proper corruption rate is achieved at the cost of a large portion of output tokens being excluded from prediction loss. To overcome the limitation, we propose free language modeling (FLM), a new pretraining objective that decouples the prediction rate from the corruption rate in masked language modeling. Our method achieves faster convergence by allowing customization of corruption spans for each token, while maintaining competitive performance on downstream vision-language tasks. Concerning cross-modal alignment granularity, we delve into fine-grained alignments between untrimmed videos and natural language. We propose a grounded vision-language learning (GVL) framework for untrimmed videos, focusing on detecting informative events and aligning multi-sentence descriptions with corresponding event segments. We introduce the parallel decoding paradigm for dense video captioning (PDVC) to segment videos effectively, enhancing the coherence and readability of generated dense captions. Furthermore, two dual pretext tasks are proposed to encourage fine-grained segment-level alignments: text-to-event contrast and event-to-text generation. The framework is versatile and applicable to visually-grounded language understanding and generation tasks. We conduct extensive experiments to validate our proposed methodologies. These efforts not only advance the frontiers of multimodal learning but also pave the way for more efficient and effective integration of vision and language in machine intelligence systems. (400 words)	-
dc.language	eng	-
dc.publisher	The University of Hong Kong (Pokfulam, Hong Kong)	-
dc.relation.ispartof	HKU Theses Online (HKUTO)	-
dc.rights	The author retains all proprietary rights, (such as patent rights) and the right to use in future works.	-
dc.rights	This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.	-
dc.subject.lcsh	Computer vision	-
dc.subject.lcsh	Machine learning	-
dc.title	Learning vision-language representation for multimodal understanding	-
dc.type	PG_Thesis	-
dc.description.thesisname	Doctor of Philosophy	-
dc.description.thesislevel	Doctoral	-
dc.description.thesisdiscipline	Computer Science	-
dc.description.nature	published_or_final_version	-
dc.date.hkucongregation	2024	-
dc.date.hkucongregation	2024	-
dc.identifier.mmsid	991044861893103414	-

File Download

Supplementary

postgraduate thesis: Learning vision-language representation for multimodal understanding

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats