Pyramid vision transformer: A versatile backbone for dense prediction without convolutions

Wang, W; Xie, E; Li, X; Fan, DP; Song, K; Liang, D; Lu, T; Luo, P; Shao, L

File Download

There are no files associated with this item.

Links for fulltext

(May Require Subscription)

Publisher Website: 10.1109/ICCV48922.2021.00061
WOS: WOS:000797698900055

Supplementary

Citations:
- Web of Science: 0
Appears in Collections:
- Computer Science: Conference papers

Conference Paper: Pyramid vision transformer: A versatile backbone for dense prediction without convolutions

Title	Pyramid vision transformer: A versatile backbone for dense prediction without convolutions
Authors	Wang, W Xie, E Li, X Fan, DP Song, K Liang, D Lu, T Luo, P Shao, L
Issue Date	2021
Publisher	IEEE Computer Society.
Citation	2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, October 10-17, 2021. In Proceedings: 2021 IEEE/CVF International Conference on Computer Vision: ICCV 2021, 11-17 October 2021, Virtual event, p. 548-558 How to Cite? DOI: http://dx.doi.org/10.1109/ICCV48922.2021.00061
Abstract	Although convolutional neural networks (CNNs) have achieved great success in computer vision, this work investigates a simpler, convolution-free backbone network use-fid for many dense prediction tasks. Unlike the recently-proposed Vision Transformer (ViT) that was designed for image classification specifically, we introduce the Pyramid Vision Transformer (PVT), which overcomes the difficulties of porting Transformer to various dense prediction tasks. PVT has several merits compared to current state of the arts. (1) Different from ViT that typically yields low-resolution outputs and incurs high computational and memory costs, PVT not only can be trained on dense partitions of an image to achieve high output resolution, which is important for dense prediction, but also uses a progressive shrinking pyramid to reduce the computations of large feature maps. (2) PVT inherits the advantages of both CNN and Transformer, making it a unified backbone for various vision tasks without convolutions, where it can be used as a direct replacement for CNN backbones. (3) We validate PVT through extensive experiments, showing that it boosts the performance of many downstream tasks, including object detection, instance and semantic segmentation. For example, with a comparable number of parameters, PVT+RetinaNet achieves 40.4 AP on the COCO dataset, surpassing ResNet50+RetinNet (36.3 AP) by 4.1 absolute AP (see Figure 2). We hope that PVT could, serre as an alternative and useful backbone for pixel-level predictions and facilitate future research.
Persistent Identifier	http://hdl.handle.net/10722/315684
ISI Accession Number ID	WOS:000797698900055

DC Field	Value	Language
dc.contributor.author	Wang, W	-
dc.contributor.author	Xie, E	-
dc.contributor.author	Li, X	-
dc.contributor.author	Fan, DP	-
dc.contributor.author	Song, K	-
dc.contributor.author	Liang, D	-
dc.contributor.author	Lu, T	-
dc.contributor.author	Luo, P	-
dc.contributor.author	Shao, L	-
dc.date.accessioned	2022-08-19T09:02:30Z	-
dc.date.available	2022-08-19T09:02:30Z	-
dc.date.issued	2021	-
dc.identifier.citation	2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, October 10-17, 2021. In Proceedings: 2021 IEEE/CVF International Conference on Computer Vision: ICCV 2021, 11-17 October 2021, Virtual event, p. 548-558	-
dc.identifier.uri	http://hdl.handle.net/10722/315684	-
dc.description.abstract	Although convolutional neural networks (CNNs) have achieved great success in computer vision, this work investigates a simpler, convolution-free backbone network use-fid for many dense prediction tasks. Unlike the recently-proposed Vision Transformer (ViT) that was designed for image classification specifically, we introduce the Pyramid Vision Transformer (PVT), which overcomes the difficulties of porting Transformer to various dense prediction tasks. PVT has several merits compared to current state of the arts. (1) Different from ViT that typically yields low-resolution outputs and incurs high computational and memory costs, PVT not only can be trained on dense partitions of an image to achieve high output resolution, which is important for dense prediction, but also uses a progressive shrinking pyramid to reduce the computations of large feature maps. (2) PVT inherits the advantages of both CNN and Transformer, making it a unified backbone for various vision tasks without convolutions, where it can be used as a direct replacement for CNN backbones. (3) We validate PVT through extensive experiments, showing that it boosts the performance of many downstream tasks, including object detection, instance and semantic segmentation. For example, with a comparable number of parameters, PVT+RetinaNet achieves 40.4 AP on the COCO dataset, surpassing ResNet50+RetinNet (36.3 AP) by 4.1 absolute AP (see Figure 2). We hope that PVT could, serre as an alternative and useful backbone for pixel-level predictions and facilitate future research.	-
dc.language	eng	-
dc.publisher	IEEE Computer Society.	-
dc.relation.ispartof	Proceedings: 2021 IEEE/CVF International Conference on Computer Vision: ICCV 2021, 11-17 October 2021, Virtual event	-
dc.rights	Proceedings: 2021 IEEE/CVF International Conference on Computer Vision: ICCV 2021, 11-17 October 2021, Virtual event. Copyright © IEEE Computer Society.	-
dc.title	Pyramid vision transformer: A versatile backbone for dense prediction without convolutions	-
dc.type	Conference_Paper	-
dc.identifier.email	Luo, P: pluo@hku.hk	-
dc.identifier.authority	Luo, P=rp02575	-
dc.identifier.doi	10.1109/ICCV48922.2021.00061	-
dc.identifier.hkuros	335601	-
dc.identifier.spage	548	-
dc.identifier.epage	558	-
dc.identifier.isi	WOS:000797698900055	-
dc.publisher.place	United States	-

File Download

Links for fulltext

(May Require Subscription)

Supplementary

Conference Paper: Pyramid vision transformer: A versatile backbone for dense prediction without convolutions

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats