File Download
Supplementary
-
Citations:
- Appears in Collections:
postgraduate thesis: Learning data consistency for image geometry and text under indirect supervision
Title | Learning data consistency for image geometry and text under indirect supervision |
---|---|
Authors | |
Advisors | |
Issue Date | 2022 |
Publisher | The University of Hong Kong (Pokfulam, Hong Kong) |
Citation | Chen, N. [陳能侖]. (2022). Learning data consistency for image geometry and text under indirect supervision. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR. |
Abstract | Acquiring high quality labeled data for training deep neural networks is usually expensive and time consuming. It is thus of great importance to make the neural networks have the ability to learn from the data without direct supervision. In general, there exist various task configurations, including self-supervised learning, weakly supervised learning, and semi-supervised learning. Though the types of supervision used in different tasks vary a lot, learning data consistency is usually an essential problem in scenarios when direct supervision is not available. Due to the lack of direct supervision, researchers in this field usually leverage various kinds of task specific priors in the training stage, and the priors can be either manually designed or learned from a collection of related data. Moreover, for different data modalities, the priors to be used for learning data consistency can be quite different. And learning data consistency for multiple data modalities jointly is also very meaningful. In this thesis, we mainly focus on learning semantic consistency from a large collection of data and making the output of the neural networks to be semantically consistent across different data samples or modalities. Specifically, three problems are studied, including learning 3D shape consistency, learning image text consistency for image captioning as well as learning image representations with geometric set consistency.
In the first part, a self-supervised learning method is proposed to learn structure points for 3D shapes in the form of point clouds. The learned structure points are consistent across different shapes with similar geometric structures. The proposed method is simple and quite effective. And the only constraint used for training the neural network is the chamfer distance loss which enforces the produced structure points to be uniformly distributed on the input point clouds. Extensive experiments on several downstream tasks, including 3D semantic correspondence, example based label transfer and PCA based shape embedding, demonstrate the effectiveness of the proposed method.
For the problem of image text consistency in the scenario of image captioning, a simple yet effective neural network architecture is proposed. Specifically, the proposed architecture is built upon Long Short-Term Memory (LSTM) based framework, and a distributed attention mechanism is designed to make the neural network attending to some regions that have consistent semantics but with different spatial positions when producing the words. In this way, the partial grounding issue can be effectively alleviated. Qualitative and quantitative experiments verify the superiority of the proposed method.
Finally, a novel contrastive learning based algorithm is presented for self-supervised learning of 2D image representations. Specifically, 3D geometric set consistency priors are used as strong cues to constrain the learned 2D image representations to be consistent within image views. And the InfoNCE loss is adapted accordingly to enforce set level consistency. The learned image representations are general and can be used to improve the performance of several 2D image-based indoor scene understanding tasks. Extensive experiments demonstrate the superior performance of our method compared with the state-of-the-art methods. |
Degree | Doctor of Philosophy |
Subject | Computer vision Image processing |
Dept/Program | Computer Science |
Persistent Identifier | http://hdl.handle.net/10722/323175 |
DC Field | Value | Language |
---|---|---|
dc.contributor.advisor | Luo, P | - |
dc.contributor.advisor | Wang, WP | - |
dc.contributor.author | Chen, Nenglun | - |
dc.contributor.author | 陳能侖 | - |
dc.date.accessioned | 2022-11-23T10:25:16Z | - |
dc.date.available | 2022-11-23T10:25:16Z | - |
dc.date.issued | 2022 | - |
dc.identifier.citation | Chen, N. [陳能侖]. (2022). Learning data consistency for image geometry and text under indirect supervision. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR. | - |
dc.identifier.uri | http://hdl.handle.net/10722/323175 | - |
dc.description.abstract | Acquiring high quality labeled data for training deep neural networks is usually expensive and time consuming. It is thus of great importance to make the neural networks have the ability to learn from the data without direct supervision. In general, there exist various task configurations, including self-supervised learning, weakly supervised learning, and semi-supervised learning. Though the types of supervision used in different tasks vary a lot, learning data consistency is usually an essential problem in scenarios when direct supervision is not available. Due to the lack of direct supervision, researchers in this field usually leverage various kinds of task specific priors in the training stage, and the priors can be either manually designed or learned from a collection of related data. Moreover, for different data modalities, the priors to be used for learning data consistency can be quite different. And learning data consistency for multiple data modalities jointly is also very meaningful. In this thesis, we mainly focus on learning semantic consistency from a large collection of data and making the output of the neural networks to be semantically consistent across different data samples or modalities. Specifically, three problems are studied, including learning 3D shape consistency, learning image text consistency for image captioning as well as learning image representations with geometric set consistency. In the first part, a self-supervised learning method is proposed to learn structure points for 3D shapes in the form of point clouds. The learned structure points are consistent across different shapes with similar geometric structures. The proposed method is simple and quite effective. And the only constraint used for training the neural network is the chamfer distance loss which enforces the produced structure points to be uniformly distributed on the input point clouds. Extensive experiments on several downstream tasks, including 3D semantic correspondence, example based label transfer and PCA based shape embedding, demonstrate the effectiveness of the proposed method. For the problem of image text consistency in the scenario of image captioning, a simple yet effective neural network architecture is proposed. Specifically, the proposed architecture is built upon Long Short-Term Memory (LSTM) based framework, and a distributed attention mechanism is designed to make the neural network attending to some regions that have consistent semantics but with different spatial positions when producing the words. In this way, the partial grounding issue can be effectively alleviated. Qualitative and quantitative experiments verify the superiority of the proposed method. Finally, a novel contrastive learning based algorithm is presented for self-supervised learning of 2D image representations. Specifically, 3D geometric set consistency priors are used as strong cues to constrain the learned 2D image representations to be consistent within image views. And the InfoNCE loss is adapted accordingly to enforce set level consistency. The learned image representations are general and can be used to improve the performance of several 2D image-based indoor scene understanding tasks. Extensive experiments demonstrate the superior performance of our method compared with the state-of-the-art methods. | - |
dc.language | eng | - |
dc.publisher | The University of Hong Kong (Pokfulam, Hong Kong) | - |
dc.relation.ispartof | HKU Theses Online (HKUTO) | - |
dc.rights | The author retains all proprietary rights, (such as patent rights) and the right to use in future works. | - |
dc.rights | This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. | - |
dc.subject.lcsh | Computer vision | - |
dc.subject.lcsh | Image processing | - |
dc.title | Learning data consistency for image geometry and text under indirect supervision | - |
dc.type | PG_Thesis | - |
dc.description.thesisname | Doctor of Philosophy | - |
dc.description.thesislevel | Doctoral | - |
dc.description.thesisdiscipline | Computer Science | - |
dc.description.nature | published_or_final_version | - |
dc.date.hkucongregation | 2022 | - |
dc.identifier.mmsid | 991044609100103414 | - |