Deep learning for visual retrieval, visual grounding and visual reasoning

陳振方; Chen, Zhenfang

File Download

FullText.pdf

Supplementary

Citations:
Appears in Collections:
- HKU Theses Online
- Computer Science: Theses

postgraduate thesis: Deep learning for visual retrieval, visual grounding and visual reasoning

Title	Deep learning for visual retrieval, visual grounding and visual reasoning
Authors	陳振方 Chen, Zhenfang
Advisors	Advisor(s):Wong, KKY
Issue Date	2021
Publisher	The University of Hong Kong (Pokfulam, Hong Kong)
Citation	陳振方, [Chen, Zhenfang]. (2021). Deep learning for visual retrieval, visual grounding and visual reasoning. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.
Abstract	This thesis applies deep learning representation to challenging computer vision applications, including single-modal visual object retrieval, cross-modal video grounding, and visual reasoning over vision and language. With the powerful deep learning representation and the carefully-designed models, this thesis tries to develop machines to better perceive and reason about the rich visual world. In the first part of this thesis, we propose a novel approach for the problem of single-modal visual object retrieval, which aims at retrieving gallery images containing the same objects as the query image. We apply a correlation layer to the locally-aggregated deep features and compute a local similarity that can not only handle small objects, but also capture spatial relations between the query and gallery images. We also reduce the memory and storage footprints by quantizing local features. Extensive experiments on challenging benchmarks demonstrate the effectiveness of our method. We then target at the problem of video grounding. It requires a model to localize a spatio-temporal tube from the video, which semantically matches the language query. During training, we have only access to sentence-video pairs without any spatio-temporal annotation. We first link frame-level object proposals into spatio-temporal proposals. We then propose an attentive interactor to exploit the complicated relationships between spatio-temporal proposals and the given language to yield their matching scores. We introduce a novel diversity loss to strengthen the matching behaviors of reliable instance-sentence pairs and penalize the unreliable ones. Extensive experimental results demonstrate the superiority of our model. The third part of the thesis addresses the problem of visual reasoning over image and language. To evaluate models' reasoning ability, we introduce a new dataset and task named compositional referring expression comprehension (Cops-Ref), which requires a model to localize the referent from a set of similar images. To build the benchmark, we first design a novel expression engine rendering various reasoning logic forms and generate expressions with varying compositionality. Second, we add distracting images containing objects sharing similar properties with the referent, thus minimizing the success rate of reasoning-free alignment. Only the models that fully understand the logic and relations embodied in the expression and distinguish subtle visual differences can achieve high performance on Cops-Ref. We propose a modular hard-mining model, which performs the best among all the baselines. We finally study the problem of dynamic visual reasoning and present the Dynamic Concept Learner (DCL) to learn physical object and event concepts from video question answering. DCL represents objects as latent feature vectors and approximates the dynamic interaction using graph networks. DCL parses the question into a semantic program and executes the program to answer the question. DCL can ground visual properties and physical events, understand the causal relationship between events, make future and counterfactual predictions, and leverage these extracted presentations for answering queries. DCL achieves state-of-the-art performance on CLEVRER, a challenging video reasoning dataset, without using attribute and collision labels during training. DCL can also be extended to video retrieval and event localization tasks.
Degree	Doctor of Philosophy
Subject	Machine learning Computer vision
Dept/Program	Computer Science
Persistent Identifier	http://hdl.handle.net/10722/302569

DC Field	Value	Language
dc.contributor.advisor	Wong, KKY	-
dc.contributor.author	陳振方	-
dc.contributor.author	Chen, Zhenfang	-
dc.date.accessioned	2021-09-07T03:41:28Z	-
dc.date.available	2021-09-07T03:41:28Z	-
dc.date.issued	2021	-
dc.identifier.citation	陳振方, [Chen, Zhenfang]. (2021). Deep learning for visual retrieval, visual grounding and visual reasoning. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.	-
dc.identifier.uri	http://hdl.handle.net/10722/302569	-
dc.description.abstract	This thesis applies deep learning representation to challenging computer vision applications, including single-modal visual object retrieval, cross-modal video grounding, and visual reasoning over vision and language. With the powerful deep learning representation and the carefully-designed models, this thesis tries to develop machines to better perceive and reason about the rich visual world. In the first part of this thesis, we propose a novel approach for the problem of single-modal visual object retrieval, which aims at retrieving gallery images containing the same objects as the query image. We apply a correlation layer to the locally-aggregated deep features and compute a local similarity that can not only handle small objects, but also capture spatial relations between the query and gallery images. We also reduce the memory and storage footprints by quantizing local features. Extensive experiments on challenging benchmarks demonstrate the effectiveness of our method. We then target at the problem of video grounding. It requires a model to localize a spatio-temporal tube from the video, which semantically matches the language query. During training, we have only access to sentence-video pairs without any spatio-temporal annotation. We first link frame-level object proposals into spatio-temporal proposals. We then propose an attentive interactor to exploit the complicated relationships between spatio-temporal proposals and the given language to yield their matching scores. We introduce a novel diversity loss to strengthen the matching behaviors of reliable instance-sentence pairs and penalize the unreliable ones. Extensive experimental results demonstrate the superiority of our model. The third part of the thesis addresses the problem of visual reasoning over image and language. To evaluate models' reasoning ability, we introduce a new dataset and task named compositional referring expression comprehension (Cops-Ref), which requires a model to localize the referent from a set of similar images. To build the benchmark, we first design a novel expression engine rendering various reasoning logic forms and generate expressions with varying compositionality. Second, we add distracting images containing objects sharing similar properties with the referent, thus minimizing the success rate of reasoning-free alignment. Only the models that fully understand the logic and relations embodied in the expression and distinguish subtle visual differences can achieve high performance on Cops-Ref. We propose a modular hard-mining model, which performs the best among all the baselines. We finally study the problem of dynamic visual reasoning and present the Dynamic Concept Learner (DCL) to learn physical object and event concepts from video question answering. DCL represents objects as latent feature vectors and approximates the dynamic interaction using graph networks. DCL parses the question into a semantic program and executes the program to answer the question. DCL can ground visual properties and physical events, understand the causal relationship between events, make future and counterfactual predictions, and leverage these extracted presentations for answering queries. DCL achieves state-of-the-art performance on CLEVRER, a challenging video reasoning dataset, without using attribute and collision labels during training. DCL can also be extended to video retrieval and event localization tasks.	-
dc.language	eng	-
dc.publisher	The University of Hong Kong (Pokfulam, Hong Kong)	-
dc.relation.ispartof	HKU Theses Online (HKUTO)	-
dc.rights	The author retains all proprietary rights, (such as patent rights) and the right to use in future works.	-
dc.rights	This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.	-
dc.subject.lcsh	Machine learning	-
dc.subject.lcsh	Computer vision	-
dc.title	Deep learning for visual retrieval, visual grounding and visual reasoning	-
dc.type	PG_Thesis	-
dc.description.thesisname	Doctor of Philosophy	-
dc.description.thesislevel	Doctoral	-
dc.description.thesisdiscipline	Computer Science	-
dc.description.nature	published_or_final_version	-
dc.date.hkucongregation	2021	-
dc.identifier.mmsid	991044410246503414	-

File Download

Supplementary

postgraduate thesis: Deep learning for visual retrieval, visual grounding and visual reasoning

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats