File Download
Supplementary

postgraduate thesis: Vision-language fusion and reasoning in visual grounding

TitleVision-language fusion and reasoning in visual grounding
Authors
Advisors
Advisor(s):Yu, Y
Issue Date2020
PublisherThe University of Hong Kong (Pokfulam, Hong Kong)
Citation
Yang, S. [杨思蓓]. (2020). Vision-language fusion and reasoning in visual grounding. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.
AbstractThe understanding of human cognition with high-level semantics serves the main challenge of the interaction between vision and language. I believe that natural language involves the cognitive understanding of the world, which is more complicated than visual perception. The understanding of visual contents, natural languages, and their relationship are critical for exploring human cognition beyond perception. In the thesis, I focus on vision-language fusion and reasoning and work on visual grounding. Visual grounding aims to locate the referent that is the corresponding visual region referred to by a natural language expression in an image. The challenge of it is that the natural language sentence is generated in unconstrained scenes, which normally not only describes the appearance of the referent but also its relationships to other objects. I address the challenge from two perspectives of relationship-embedded cross-modal fusion and language-driven visual reasoning. The perspective of relationship-embedded cross-modal fusion addresses visual grounding using the proposed cross-modal relationship inference network (CMRIN) and one-stage relational propagation network (OSRPN). The extraction and modeling of relationships among objects are essential for visual grounding. Besides, multi-order relationships can be captured explicitly by graph-based information propagation which also helps to fuse information in different modalities. Therefore, I employ CMRIN to construct a language-guided visual relation graph with cross-modal attention and capture the relationship-embedded contexts. CMRIN achieves the new state-of-the-art results on all the benchmark datasets. Compared with two-stage frameworks (e.g., CMRIN), there is no explicit object-level information existing in one-stage grounding. Therefore, OSRPN implicitly models relationships by associating objects with nodes of linguistic graph parsed from the sentence and performing relational propagation over the graph. OSRPN outperforms state-of-the-art methods. From the perspective of language-driven visual reasoning, I propose the dynamic graph attention network (DGA) and scene graph guided modular network (SGMN). The motivations come from three perspectives: (1) visual grounding is compositional, which inherently requires visual reasoning on top of the relationships among objects in the image; (2) human visual reasoning of grounding is guided by the linguistics structure of the referring expression; (3) Explicit alignment between the linguistic components and the visual contents provides reliable and interpretable reasoning process. DGA performs multi-step reasoning explicitly by identifying the compound objects step by step, which is guided by the learned linguistic structure of the expression. DGA can show visual evidence for stepwise locating the objects referred to in complex language descriptions. DGA resorts to self-attention on the expression to explore its linguistic structure and learns feature representations for compound objects, while SGMN parses the expression as a structured language scene graph and explicitly aligns the linguistic components with the visual contents by performing graph-structured reasoning over scene graphs with neural modules. SGMN not only achieves the new state-of-the-art results on complex expressions but also generates interpretable and visualizable intermediate steps explicitly in the reasoning process via a graph attention mechanism. Besides, a large-scale real-world dataset named Ref-Reasoning for visual grounding reasoning is developed. This dataset contains real-world visual contents and semantically rich expressions with different reasoning layouts.
DegreeDoctor of Philosophy
SubjectNatural language processing (Computer science)
Machine learning - Evaluation
Image processing - Data processing
Dept/ProgramComputer Science
Persistent Identifierhttp://hdl.handle.net/10722/290443

 

DC FieldValueLanguage
dc.contributor.advisorYu, Y-
dc.contributor.authorYang, Sibei-
dc.contributor.author杨思蓓-
dc.date.accessioned2020-11-02T01:56:17Z-
dc.date.available2020-11-02T01:56:17Z-
dc.date.issued2020-
dc.identifier.citationYang, S. [杨思蓓]. (2020). Vision-language fusion and reasoning in visual grounding. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.-
dc.identifier.urihttp://hdl.handle.net/10722/290443-
dc.description.abstractThe understanding of human cognition with high-level semantics serves the main challenge of the interaction between vision and language. I believe that natural language involves the cognitive understanding of the world, which is more complicated than visual perception. The understanding of visual contents, natural languages, and their relationship are critical for exploring human cognition beyond perception. In the thesis, I focus on vision-language fusion and reasoning and work on visual grounding. Visual grounding aims to locate the referent that is the corresponding visual region referred to by a natural language expression in an image. The challenge of it is that the natural language sentence is generated in unconstrained scenes, which normally not only describes the appearance of the referent but also its relationships to other objects. I address the challenge from two perspectives of relationship-embedded cross-modal fusion and language-driven visual reasoning. The perspective of relationship-embedded cross-modal fusion addresses visual grounding using the proposed cross-modal relationship inference network (CMRIN) and one-stage relational propagation network (OSRPN). The extraction and modeling of relationships among objects are essential for visual grounding. Besides, multi-order relationships can be captured explicitly by graph-based information propagation which also helps to fuse information in different modalities. Therefore, I employ CMRIN to construct a language-guided visual relation graph with cross-modal attention and capture the relationship-embedded contexts. CMRIN achieves the new state-of-the-art results on all the benchmark datasets. Compared with two-stage frameworks (e.g., CMRIN), there is no explicit object-level information existing in one-stage grounding. Therefore, OSRPN implicitly models relationships by associating objects with nodes of linguistic graph parsed from the sentence and performing relational propagation over the graph. OSRPN outperforms state-of-the-art methods. From the perspective of language-driven visual reasoning, I propose the dynamic graph attention network (DGA) and scene graph guided modular network (SGMN). The motivations come from three perspectives: (1) visual grounding is compositional, which inherently requires visual reasoning on top of the relationships among objects in the image; (2) human visual reasoning of grounding is guided by the linguistics structure of the referring expression; (3) Explicit alignment between the linguistic components and the visual contents provides reliable and interpretable reasoning process. DGA performs multi-step reasoning explicitly by identifying the compound objects step by step, which is guided by the learned linguistic structure of the expression. DGA can show visual evidence for stepwise locating the objects referred to in complex language descriptions. DGA resorts to self-attention on the expression to explore its linguistic structure and learns feature representations for compound objects, while SGMN parses the expression as a structured language scene graph and explicitly aligns the linguistic components with the visual contents by performing graph-structured reasoning over scene graphs with neural modules. SGMN not only achieves the new state-of-the-art results on complex expressions but also generates interpretable and visualizable intermediate steps explicitly in the reasoning process via a graph attention mechanism. Besides, a large-scale real-world dataset named Ref-Reasoning for visual grounding reasoning is developed. This dataset contains real-world visual contents and semantically rich expressions with different reasoning layouts.-
dc.languageeng-
dc.publisherThe University of Hong Kong (Pokfulam, Hong Kong)-
dc.relation.ispartofHKU Theses Online (HKUTO)-
dc.rightsThe author retains all proprietary rights, (such as patent rights) and the right to use in future works.-
dc.rightsThis work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.-
dc.subject.lcshNatural language processing (Computer science)-
dc.subject.lcshMachine learning - Evaluation-
dc.subject.lcshImage processing - Data processing-
dc.titleVision-language fusion and reasoning in visual grounding-
dc.typePG_Thesis-
dc.description.thesisnameDoctor of Philosophy-
dc.description.thesislevelDoctoral-
dc.description.thesisdisciplineComputer Science-
dc.description.naturepublished_or_final_version-
dc.date.hkucongregation2020-
dc.identifier.mmsid991044291216103414-

Export via OAI-PMH Interface in XML Formats


OR


Export to Other Non-XML Formats