Vision-language fusion and reasoning in visual grounding

Yang, Sibei; 杨思蓓

File Download

FullText.pdf

Supplementary

Citations:
Appears in Collections:
- HKU Theses Online
- Computer Science: Theses

postgraduate thesis: Vision-language fusion and reasoning in visual grounding

Title	Vision-language fusion and reasoning in visual grounding
Authors	Yang, Sibei 杨思蓓
Advisors	Advisor(s):Yu, Y
Issue Date	2020
Publisher	The University of Hong Kong (Pokfulam, Hong Kong)
Citation	Yang, S. [杨思蓓]. (2020). Vision-language fusion and reasoning in visual grounding. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.
Abstract	The understanding of human cognition with high-level semantics serves the main challenge of the interaction between vision and language. I believe that natural language involves the cognitive understanding of the world, which is more complicated than visual perception. The understanding of visual contents, natural languages, and their relationship are critical for exploring human cognition beyond perception. In the thesis, I focus on vision-language fusion and reasoning and work on visual grounding. Visual grounding aims to locate the referent that is the corresponding visual region referred to by a natural language expression in an image. The challenge of it is that the natural language sentence is generated in unconstrained scenes, which normally not only describes the appearance of the referent but also its relationships to other objects. I address the challenge from two perspectives of relationship-embedded cross-modal fusion and language-driven visual reasoning. The perspective of relationship-embedded cross-modal fusion addresses visual grounding using the proposed cross-modal relationship inference network (CMRIN) and one-stage relational propagation network (OSRPN). The extraction and modeling of relationships among objects are essential for visual grounding. Besides, multi-order relationships can be captured explicitly by graph-based information propagation which also helps to fuse information in different modalities. Therefore, I employ CMRIN to construct a language-guided visual relation graph with cross-modal attention and capture the relationship-embedded contexts. CMRIN achieves the new state-of-the-art results on all the benchmark datasets. Compared with two-stage frameworks (e.g., CMRIN), there is no explicit object-level information existing in one-stage grounding. Therefore, OSRPN implicitly models relationships by associating objects with nodes of linguistic graph parsed from the sentence and performing relational propagation over the graph. OSRPN outperforms state-of-the-art methods. From the perspective of language-driven visual reasoning, I propose the dynamic graph attention network (DGA) and scene graph guided modular network (SGMN). The motivations come from three perspectives: (1) visual grounding is compositional, which inherently requires visual reasoning on top of the relationships among objects in the image; (2) human visual reasoning of grounding is guided by the linguistics structure of the referring expression; (3) Explicit alignment between the linguistic components and the visual contents provides reliable and interpretable reasoning process. DGA performs multi-step reasoning explicitly by identifying the compound objects step by step, which is guided by the learned linguistic structure of the expression. DGA can show visual evidence for stepwise locating the objects referred to in complex language descriptions. DGA resorts to self-attention on the expression to explore its linguistic structure and learns feature representations for compound objects, while SGMN parses the expression as a structured language scene graph and explicitly aligns the linguistic components with the visual contents by performing graph-structured reasoning over scene graphs with neural modules. SGMN not only achieves the new state-of-the-art results on complex expressions but also generates interpretable and visualizable intermediate steps explicitly in the reasoning process via a graph attention mechanism. Besides, a large-scale real-world dataset named Ref-Reasoning for visual grounding reasoning is developed. This dataset contains real-world visual contents and semantically rich expressions with different reasoning layouts.
Degree	Doctor of Philosophy
Subject	Natural language processing (Computer science) Machine learning - Evaluation Image processing - Data processing
Dept/Program	Computer Science
Persistent Identifier	http://hdl.handle.net/10722/290443

DC Field	Value	Language
dc.contributor.advisor	Yu, Y	-
dc.contributor.author	Yang, Sibei	-
dc.contributor.author	杨思蓓	-
dc.date.accessioned	2020-11-02T01:56:17Z	-
dc.date.available	2020-11-02T01:56:17Z	-
dc.date.issued	2020	-
dc.identifier.citation	Yang, S. [杨思蓓]. (2020). Vision-language fusion and reasoning in visual grounding. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.	-
dc.identifier.uri	http://hdl.handle.net/10722/290443	-
dc.description.abstract	The understanding of human cognition with high-level semantics serves the main challenge of the interaction between vision and language. I believe that natural language involves the cognitive understanding of the world, which is more complicated than visual perception. The understanding of visual contents, natural languages, and their relationship are critical for exploring human cognition beyond perception. In the thesis, I focus on vision-language fusion and reasoning and work on visual grounding. Visual grounding aims to locate the referent that is the corresponding visual region referred to by a natural language expression in an image. The challenge of it is that the natural language sentence is generated in unconstrained scenes, which normally not only describes the appearance of the referent but also its relationships to other objects. I address the challenge from two perspectives of relationship-embedded cross-modal fusion and language-driven visual reasoning. The perspective of relationship-embedded cross-modal fusion addresses visual grounding using the proposed cross-modal relationship inference network (CMRIN) and one-stage relational propagation network (OSRPN). The extraction and modeling of relationships among objects are essential for visual grounding. Besides, multi-order relationships can be captured explicitly by graph-based information propagation which also helps to fuse information in different modalities. Therefore, I employ CMRIN to construct a language-guided visual relation graph with cross-modal attention and capture the relationship-embedded contexts. CMRIN achieves the new state-of-the-art results on all the benchmark datasets. Compared with two-stage frameworks (e.g., CMRIN), there is no explicit object-level information existing in one-stage grounding. Therefore, OSRPN implicitly models relationships by associating objects with nodes of linguistic graph parsed from the sentence and performing relational propagation over the graph. OSRPN outperforms state-of-the-art methods. From the perspective of language-driven visual reasoning, I propose the dynamic graph attention network (DGA) and scene graph guided modular network (SGMN). The motivations come from three perspectives: (1) visual grounding is compositional, which inherently requires visual reasoning on top of the relationships among objects in the image; (2) human visual reasoning of grounding is guided by the linguistics structure of the referring expression; (3) Explicit alignment between the linguistic components and the visual contents provides reliable and interpretable reasoning process. DGA performs multi-step reasoning explicitly by identifying the compound objects step by step, which is guided by the learned linguistic structure of the expression. DGA can show visual evidence for stepwise locating the objects referred to in complex language descriptions. DGA resorts to self-attention on the expression to explore its linguistic structure and learns feature representations for compound objects, while SGMN parses the expression as a structured language scene graph and explicitly aligns the linguistic components with the visual contents by performing graph-structured reasoning over scene graphs with neural modules. SGMN not only achieves the new state-of-the-art results on complex expressions but also generates interpretable and visualizable intermediate steps explicitly in the reasoning process via a graph attention mechanism. Besides, a large-scale real-world dataset named Ref-Reasoning for visual grounding reasoning is developed. This dataset contains real-world visual contents and semantically rich expressions with different reasoning layouts.	-
dc.language	eng	-
dc.publisher	The University of Hong Kong (Pokfulam, Hong Kong)	-
dc.relation.ispartof	HKU Theses Online (HKUTO)	-
dc.rights	The author retains all proprietary rights, (such as patent rights) and the right to use in future works.	-
dc.rights	This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.	-
dc.subject.lcsh	Natural language processing (Computer science)	-
dc.subject.lcsh	Machine learning - Evaluation	-
dc.subject.lcsh	Image processing - Data processing	-
dc.title	Vision-language fusion and reasoning in visual grounding	-
dc.type	PG_Thesis	-
dc.description.thesisname	Doctor of Philosophy	-
dc.description.thesislevel	Doctoral	-
dc.description.thesisdiscipline	Computer Science	-
dc.description.nature	published_or_final_version	-
dc.date.hkucongregation	2020	-
dc.identifier.mmsid	991044291216103414	-

File Download

Supplementary

postgraduate thesis: Vision-language fusion and reasoning in visual grounding

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats