File Download
Supplementary
-
Citations:
- Appears in Collections:
postgraduate thesis: Vision-language fusion and reasoning in visual grounding
Title | Vision-language fusion and reasoning in visual grounding |
---|---|
Authors | |
Advisors | Advisor(s):Yu, Y |
Issue Date | 2020 |
Publisher | The University of Hong Kong (Pokfulam, Hong Kong) |
Citation | Yang, S. [杨思蓓]. (2020). Vision-language fusion and reasoning in visual grounding. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR. |
Abstract | The understanding of human cognition with high-level semantics serves the main challenge of the interaction between vision and language. I believe that natural language involves the cognitive understanding of the world, which is more complicated than visual perception. The understanding of visual contents, natural languages, and their relationship are critical for exploring human cognition beyond perception.
In the thesis, I focus on vision-language fusion and reasoning and work on visual grounding. Visual grounding aims to locate the referent that is the corresponding visual region referred to by a natural language expression in an image. The challenge of it is that the natural language sentence is generated in unconstrained scenes, which normally not only describes the appearance of the referent but also its relationships to other objects. I address the challenge from two perspectives of relationship-embedded cross-modal fusion and language-driven visual reasoning.
The perspective of relationship-embedded cross-modal fusion addresses visual grounding using the proposed cross-modal relationship inference network (CMRIN) and one-stage relational propagation network (OSRPN). The extraction and modeling of relationships among objects are essential for visual grounding. Besides, multi-order relationships can be captured explicitly by graph-based information propagation which also helps to fuse information in different modalities. Therefore, I employ CMRIN to construct a language-guided visual relation graph with cross-modal attention and capture the relationship-embedded contexts. CMRIN achieves the new state-of-the-art results on all the benchmark datasets. Compared with two-stage frameworks (e.g., CMRIN), there is no explicit object-level information existing in one-stage grounding. Therefore, OSRPN implicitly models relationships by associating objects with nodes of linguistic graph parsed from the sentence and performing relational propagation over the graph. OSRPN outperforms state-of-the-art methods.
From the perspective of language-driven visual reasoning, I propose the dynamic graph attention network (DGA) and scene graph guided modular network (SGMN). The motivations come from three perspectives: (1) visual grounding is compositional, which inherently requires visual reasoning on top of the relationships among objects in the image; (2) human visual reasoning of grounding is guided by the linguistics structure of the referring expression; (3) Explicit alignment between the linguistic components and the visual contents provides reliable and interpretable reasoning process. DGA performs multi-step reasoning explicitly by identifying the compound objects step by step, which is guided by the learned linguistic structure of the expression. DGA can show visual evidence for stepwise locating the objects referred to in complex language descriptions. DGA resorts to self-attention on the expression to explore its linguistic structure and learns feature representations for compound objects, while SGMN parses the expression as a structured language scene graph and explicitly aligns the linguistic components with the visual contents by performing graph-structured reasoning over scene graphs with neural modules. SGMN not only achieves the new state-of-the-art results on complex expressions but also generates interpretable and visualizable intermediate steps explicitly in the reasoning process via a graph attention mechanism.
Besides, a large-scale real-world dataset named Ref-Reasoning for visual grounding reasoning is developed. This dataset contains real-world visual contents and semantically rich expressions with different reasoning layouts. |
Degree | Doctor of Philosophy |
Subject | Natural language processing (Computer science) Machine learning - Evaluation Image processing - Data processing |
Dept/Program | Computer Science |
Persistent Identifier | http://hdl.handle.net/10722/290443 |
DC Field | Value | Language |
---|---|---|
dc.contributor.advisor | Yu, Y | - |
dc.contributor.author | Yang, Sibei | - |
dc.contributor.author | 杨思蓓 | - |
dc.date.accessioned | 2020-11-02T01:56:17Z | - |
dc.date.available | 2020-11-02T01:56:17Z | - |
dc.date.issued | 2020 | - |
dc.identifier.citation | Yang, S. [杨思蓓]. (2020). Vision-language fusion and reasoning in visual grounding. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR. | - |
dc.identifier.uri | http://hdl.handle.net/10722/290443 | - |
dc.description.abstract | The understanding of human cognition with high-level semantics serves the main challenge of the interaction between vision and language. I believe that natural language involves the cognitive understanding of the world, which is more complicated than visual perception. The understanding of visual contents, natural languages, and their relationship are critical for exploring human cognition beyond perception. In the thesis, I focus on vision-language fusion and reasoning and work on visual grounding. Visual grounding aims to locate the referent that is the corresponding visual region referred to by a natural language expression in an image. The challenge of it is that the natural language sentence is generated in unconstrained scenes, which normally not only describes the appearance of the referent but also its relationships to other objects. I address the challenge from two perspectives of relationship-embedded cross-modal fusion and language-driven visual reasoning. The perspective of relationship-embedded cross-modal fusion addresses visual grounding using the proposed cross-modal relationship inference network (CMRIN) and one-stage relational propagation network (OSRPN). The extraction and modeling of relationships among objects are essential for visual grounding. Besides, multi-order relationships can be captured explicitly by graph-based information propagation which also helps to fuse information in different modalities. Therefore, I employ CMRIN to construct a language-guided visual relation graph with cross-modal attention and capture the relationship-embedded contexts. CMRIN achieves the new state-of-the-art results on all the benchmark datasets. Compared with two-stage frameworks (e.g., CMRIN), there is no explicit object-level information existing in one-stage grounding. Therefore, OSRPN implicitly models relationships by associating objects with nodes of linguistic graph parsed from the sentence and performing relational propagation over the graph. OSRPN outperforms state-of-the-art methods. From the perspective of language-driven visual reasoning, I propose the dynamic graph attention network (DGA) and scene graph guided modular network (SGMN). The motivations come from three perspectives: (1) visual grounding is compositional, which inherently requires visual reasoning on top of the relationships among objects in the image; (2) human visual reasoning of grounding is guided by the linguistics structure of the referring expression; (3) Explicit alignment between the linguistic components and the visual contents provides reliable and interpretable reasoning process. DGA performs multi-step reasoning explicitly by identifying the compound objects step by step, which is guided by the learned linguistic structure of the expression. DGA can show visual evidence for stepwise locating the objects referred to in complex language descriptions. DGA resorts to self-attention on the expression to explore its linguistic structure and learns feature representations for compound objects, while SGMN parses the expression as a structured language scene graph and explicitly aligns the linguistic components with the visual contents by performing graph-structured reasoning over scene graphs with neural modules. SGMN not only achieves the new state-of-the-art results on complex expressions but also generates interpretable and visualizable intermediate steps explicitly in the reasoning process via a graph attention mechanism. Besides, a large-scale real-world dataset named Ref-Reasoning for visual grounding reasoning is developed. This dataset contains real-world visual contents and semantically rich expressions with different reasoning layouts. | - |
dc.language | eng | - |
dc.publisher | The University of Hong Kong (Pokfulam, Hong Kong) | - |
dc.relation.ispartof | HKU Theses Online (HKUTO) | - |
dc.rights | The author retains all proprietary rights, (such as patent rights) and the right to use in future works. | - |
dc.rights | This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. | - |
dc.subject.lcsh | Natural language processing (Computer science) | - |
dc.subject.lcsh | Machine learning - Evaluation | - |
dc.subject.lcsh | Image processing - Data processing | - |
dc.title | Vision-language fusion and reasoning in visual grounding | - |
dc.type | PG_Thesis | - |
dc.description.thesisname | Doctor of Philosophy | - |
dc.description.thesislevel | Doctoral | - |
dc.description.thesisdiscipline | Computer Science | - |
dc.description.nature | published_or_final_version | - |
dc.date.hkucongregation | 2020 | - |
dc.identifier.mmsid | 991044291216103414 | - |