File Download
Supplementary
-
Citations:
- Appears in Collections:
postgraduate thesis: Deep learning for dense visual predictions
Title | Deep learning for dense visual predictions |
---|---|
Authors | |
Advisors | |
Issue Date | 2022 |
Publisher | The University of Hong Kong (Pokfulam, Hong Kong) |
Citation | Xie, E. [谢恩泽]. (2022). Deep learning for dense visual predictions. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR. |
Abstract | Understanding the content of digital images is very important to human life and has a wide range of applications in the real world. Moreover, deeply understanding each pixel in the images is essential for many applications such as autonomous driving, face verification and smart home robot. Therefore, it is crucial to develop accurate, efficient and robust methods for dense visual prediction besides visual classification, such as semantic segmentation, object detection, and instance segmentation. This thesis tackles three vision problems of dense visual predictions, namely polar representation for instance segmentation, self-supervised object detection and robust semantic segmentation with Transformers.
The first part of this thesis addresses the problem of instance segmentation. Existing approaches often solve instance segmentation with ``two stages'', which detects bounding boxes in the first stage and does pixel-level segmentation inside each box in the second stage, resulting in complicated designs and low efficiency.
Different from previous methods, we solve instance segmentation problem under a polar coordinate and propose a deep learning based framework, named \emph{PolarMask}, to use polar representation to formulate an instance mask.
We define the gravity center of each object as the origin point of polar coordinates and emit a set of rays from the center to the contour with an even angle. During training, PolarMask learns the location of the object center and the length of each ray. During testing, we can get mask prediction by assembling centers and rays. PolarMask is a single-shot anchor-free instance segmentation framework, which is much faster than previous two-stage methods.
The second part of this thesis addresses the problem of self-supervised pre-training and representation learning for instance-level detection tasks. Unlike most recent methods that focus on improving image classification accuracy, we present a novel contrastive learning approach, named DetCo, which fully explores the contrasts between global image and local image patches to learn discriminative representations for object detection. DetCo can learn powerful general feature representation from massive unlabeled image data and can largely boost up downstream tasks such as object detection and multi-person pose estimation, and benefits label efficiency.
The third part of this thesis considers the problem of efficient, strong, and robust semantic segmentation with the advanced network architecture Transformers, termed SegFormer.
SegFormer has two appealing features: 1) SegFormer comprises a novel hierarchically structured Transformer encoder which outputs multiscale features. It does not need positional encoding, thereby avoiding the interpolation of positional codes, leading to decreased performance when the testing resolution differs from training. 2) SegFormer avoids complex decoders. The lightweight All-MLP decoder of SegFormer directly fuses these multi-level features and predicts the semantic segmentation mask.
As a result, SegFormer sets a new state-of-the-art in terms of efficiency, accuracy and robustness on several benchmarks.
We also firstly verify that Transformer has much larger effective receptive field than ConvNets and firstly show excellent zero-shot robustness of Transformer on out-of-distribution data. |
Degree | Doctor of Philosophy |
Subject | Deep learning (Machine learning) Digital images Computer vision |
Dept/Program | Computer Science |
Persistent Identifier | http://hdl.handle.net/10722/322927 |
DC Field | Value | Language |
---|---|---|
dc.contributor.advisor | Luo, P | - |
dc.contributor.advisor | Wang, WP | - |
dc.contributor.author | Xie, Enze | - |
dc.contributor.author | 谢恩泽 | - |
dc.date.accessioned | 2022-11-18T10:41:50Z | - |
dc.date.available | 2022-11-18T10:41:50Z | - |
dc.date.issued | 2022 | - |
dc.identifier.citation | Xie, E. [谢恩泽]. (2022). Deep learning for dense visual predictions. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR. | - |
dc.identifier.uri | http://hdl.handle.net/10722/322927 | - |
dc.description.abstract | Understanding the content of digital images is very important to human life and has a wide range of applications in the real world. Moreover, deeply understanding each pixel in the images is essential for many applications such as autonomous driving, face verification and smart home robot. Therefore, it is crucial to develop accurate, efficient and robust methods for dense visual prediction besides visual classification, such as semantic segmentation, object detection, and instance segmentation. This thesis tackles three vision problems of dense visual predictions, namely polar representation for instance segmentation, self-supervised object detection and robust semantic segmentation with Transformers. The first part of this thesis addresses the problem of instance segmentation. Existing approaches often solve instance segmentation with ``two stages'', which detects bounding boxes in the first stage and does pixel-level segmentation inside each box in the second stage, resulting in complicated designs and low efficiency. Different from previous methods, we solve instance segmentation problem under a polar coordinate and propose a deep learning based framework, named \emph{PolarMask}, to use polar representation to formulate an instance mask. We define the gravity center of each object as the origin point of polar coordinates and emit a set of rays from the center to the contour with an even angle. During training, PolarMask learns the location of the object center and the length of each ray. During testing, we can get mask prediction by assembling centers and rays. PolarMask is a single-shot anchor-free instance segmentation framework, which is much faster than previous two-stage methods. The second part of this thesis addresses the problem of self-supervised pre-training and representation learning for instance-level detection tasks. Unlike most recent methods that focus on improving image classification accuracy, we present a novel contrastive learning approach, named DetCo, which fully explores the contrasts between global image and local image patches to learn discriminative representations for object detection. DetCo can learn powerful general feature representation from massive unlabeled image data and can largely boost up downstream tasks such as object detection and multi-person pose estimation, and benefits label efficiency. The third part of this thesis considers the problem of efficient, strong, and robust semantic segmentation with the advanced network architecture Transformers, termed SegFormer. SegFormer has two appealing features: 1) SegFormer comprises a novel hierarchically structured Transformer encoder which outputs multiscale features. It does not need positional encoding, thereby avoiding the interpolation of positional codes, leading to decreased performance when the testing resolution differs from training. 2) SegFormer avoids complex decoders. The lightweight All-MLP decoder of SegFormer directly fuses these multi-level features and predicts the semantic segmentation mask. As a result, SegFormer sets a new state-of-the-art in terms of efficiency, accuracy and robustness on several benchmarks. We also firstly verify that Transformer has much larger effective receptive field than ConvNets and firstly show excellent zero-shot robustness of Transformer on out-of-distribution data. | - |
dc.language | eng | - |
dc.publisher | The University of Hong Kong (Pokfulam, Hong Kong) | - |
dc.relation.ispartof | HKU Theses Online (HKUTO) | - |
dc.rights | The author retains all proprietary rights, (such as patent rights) and the right to use in future works. | - |
dc.rights | This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. | - |
dc.subject.lcsh | Deep learning (Machine learning) | - |
dc.subject.lcsh | Digital images | - |
dc.subject.lcsh | Computer vision | - |
dc.title | Deep learning for dense visual predictions | - |
dc.type | PG_Thesis | - |
dc.description.thesisname | Doctor of Philosophy | - |
dc.description.thesislevel | Doctoral | - |
dc.description.thesisdiscipline | Computer Science | - |
dc.description.nature | published_or_final_version | - |
dc.date.hkucongregation | 2022 | - |
dc.identifier.mmsid | 991044609104903414 | - |