Deep learning for dense visual predictions

Xie, Enze; 谢恩泽

File Download

FullText.pdf

Supplementary

Citations:
Appears in Collections:
- HKU Theses Online
- Computer Science: Theses

postgraduate thesis: Deep learning for dense visual predictions

Title	Deep learning for dense visual predictions
Authors	Xie, Enze 谢恩泽
Advisors	Advisor(s):Luo, P Wang, WP
Issue Date	2022
Publisher	The University of Hong Kong (Pokfulam, Hong Kong)
Citation	Xie, E. [谢恩泽]. (2022). Deep learning for dense visual predictions. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.
Abstract	Understanding the content of digital images is very important to human life and has a wide range of applications in the real world. Moreover, deeply understanding each pixel in the images is essential for many applications such as autonomous driving, face verification and smart home robot. Therefore, it is crucial to develop accurate, efficient and robust methods for dense visual prediction besides visual classification, such as semantic segmentation, object detection, and instance segmentation. This thesis tackles three vision problems of dense visual predictions, namely polar representation for instance segmentation, self-supervised object detection and robust semantic segmentation with Transformers. The first part of this thesis addresses the problem of instance segmentation. Existing approaches often solve instance segmentation with ``two stages'', which detects bounding boxes in the first stage and does pixel-level segmentation inside each box in the second stage, resulting in complicated designs and low efficiency. Different from previous methods, we solve instance segmentation problem under a polar coordinate and propose a deep learning based framework, named \emph{PolarMask}, to use polar representation to formulate an instance mask. We define the gravity center of each object as the origin point of polar coordinates and emit a set of rays from the center to the contour with an even angle. During training, PolarMask learns the location of the object center and the length of each ray. During testing, we can get mask prediction by assembling centers and rays. PolarMask is a single-shot anchor-free instance segmentation framework, which is much faster than previous two-stage methods. The second part of this thesis addresses the problem of self-supervised pre-training and representation learning for instance-level detection tasks. Unlike most recent methods that focus on improving image classification accuracy, we present a novel contrastive learning approach, named DetCo, which fully explores the contrasts between global image and local image patches to learn discriminative representations for object detection. DetCo can learn powerful general feature representation from massive unlabeled image data and can largely boost up downstream tasks such as object detection and multi-person pose estimation, and benefits label efficiency. The third part of this thesis considers the problem of efficient, strong, and robust semantic segmentation with the advanced network architecture Transformers, termed SegFormer. SegFormer has two appealing features: 1) SegFormer comprises a novel hierarchically structured Transformer encoder which outputs multiscale features. It does not need positional encoding, thereby avoiding the interpolation of positional codes, leading to decreased performance when the testing resolution differs from training. 2) SegFormer avoids complex decoders. The lightweight All-MLP decoder of SegFormer directly fuses these multi-level features and predicts the semantic segmentation mask. As a result, SegFormer sets a new state-of-the-art in terms of efficiency, accuracy and robustness on several benchmarks. We also firstly verify that Transformer has much larger effective receptive field than ConvNets and firstly show excellent zero-shot robustness of Transformer on out-of-distribution data.
Degree	Doctor of Philosophy
Subject	Deep learning (Machine learning) Digital images Computer vision
Dept/Program	Computer Science
Persistent Identifier	http://hdl.handle.net/10722/322927

DC Field	Value	Language
dc.contributor.advisor	Luo, P	-
dc.contributor.advisor	Wang, WP	-
dc.contributor.author	Xie, Enze	-
dc.contributor.author	谢恩泽	-
dc.date.accessioned	2022-11-18T10:41:50Z	-
dc.date.available	2022-11-18T10:41:50Z	-
dc.date.issued	2022	-
dc.identifier.citation	Xie, E. [谢恩泽]. (2022). Deep learning for dense visual predictions. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.	-
dc.identifier.uri	http://hdl.handle.net/10722/322927	-
dc.description.abstract	Understanding the content of digital images is very important to human life and has a wide range of applications in the real world. Moreover, deeply understanding each pixel in the images is essential for many applications such as autonomous driving, face verification and smart home robot. Therefore, it is crucial to develop accurate, efficient and robust methods for dense visual prediction besides visual classification, such as semantic segmentation, object detection, and instance segmentation. This thesis tackles three vision problems of dense visual predictions, namely polar representation for instance segmentation, self-supervised object detection and robust semantic segmentation with Transformers. The first part of this thesis addresses the problem of instance segmentation. Existing approaches often solve instance segmentation with ``two stages'', which detects bounding boxes in the first stage and does pixel-level segmentation inside each box in the second stage, resulting in complicated designs and low efficiency. Different from previous methods, we solve instance segmentation problem under a polar coordinate and propose a deep learning based framework, named \emph{PolarMask}, to use polar representation to formulate an instance mask. We define the gravity center of each object as the origin point of polar coordinates and emit a set of rays from the center to the contour with an even angle. During training, PolarMask learns the location of the object center and the length of each ray. During testing, we can get mask prediction by assembling centers and rays. PolarMask is a single-shot anchor-free instance segmentation framework, which is much faster than previous two-stage methods. The second part of this thesis addresses the problem of self-supervised pre-training and representation learning for instance-level detection tasks. Unlike most recent methods that focus on improving image classification accuracy, we present a novel contrastive learning approach, named DetCo, which fully explores the contrasts between global image and local image patches to learn discriminative representations for object detection. DetCo can learn powerful general feature representation from massive unlabeled image data and can largely boost up downstream tasks such as object detection and multi-person pose estimation, and benefits label efficiency. The third part of this thesis considers the problem of efficient, strong, and robust semantic segmentation with the advanced network architecture Transformers, termed SegFormer. SegFormer has two appealing features: 1) SegFormer comprises a novel hierarchically structured Transformer encoder which outputs multiscale features. It does not need positional encoding, thereby avoiding the interpolation of positional codes, leading to decreased performance when the testing resolution differs from training. 2) SegFormer avoids complex decoders. The lightweight All-MLP decoder of SegFormer directly fuses these multi-level features and predicts the semantic segmentation mask. As a result, SegFormer sets a new state-of-the-art in terms of efficiency, accuracy and robustness on several benchmarks. We also firstly verify that Transformer has much larger effective receptive field than ConvNets and firstly show excellent zero-shot robustness of Transformer on out-of-distribution data.	-
dc.language	eng	-
dc.publisher	The University of Hong Kong (Pokfulam, Hong Kong)	-
dc.relation.ispartof	HKU Theses Online (HKUTO)	-
dc.rights	The author retains all proprietary rights, (such as patent rights) and the right to use in future works.	-
dc.rights	This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.	-
dc.subject.lcsh	Deep learning (Machine learning)	-
dc.subject.lcsh	Digital images	-
dc.subject.lcsh	Computer vision	-
dc.title	Deep learning for dense visual predictions	-
dc.type	PG_Thesis	-
dc.description.thesisname	Doctor of Philosophy	-
dc.description.thesislevel	Doctoral	-
dc.description.thesisdiscipline	Computer Science	-
dc.description.nature	published_or_final_version	-
dc.date.hkucongregation	2022	-
dc.identifier.mmsid	991044609104903414	-

File Download

Supplementary

postgraduate thesis: Deep learning for dense visual predictions

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats