File Download
There are no files associated with this item.
Links for fulltext
(May Require Subscription)
- Publisher Website: 10.1109/TITS.2024.3417813
- Scopus: eid_2-s2.0-85203146661
- Find via

Supplementary
-
Citations:
- Scopus: 0
- Appears in Collections:
Article: SDPT: Semantic-Aware Dimension-Pooling Transformer for Image Segmentation
| Title | SDPT: Semantic-Aware Dimension-Pooling Transformer for Image Segmentation |
|---|---|
| Authors | |
| Keywords | dimension-pooling attention Image segmentation scene understanding semantic-balanced decoder vision transformer |
| Issue Date | 3-Jul-2024 |
| Publisher | IEEE |
| Citation | IEEE Transactions on Intelligence Transportation Systems, 2024, v. 25, n. 11, p. 15934-15946 How to Cite? |
| Abstract | Image segmentation plays a critical role in autonomous driving by providing vehicles with a detailed and accurate understanding of their surroundings. Transformers have recently shown encouraging results in image segmentation. However, transformer-based models are challenging to strike a better balance between performance and efficiency. The computational complexity of the transformer-based models is quadratic with the number of inputs, which severely hinders their application in dense prediction tasks. In this paper, we present the semantic-aware dimension-pooling transformer (SDPT) to mitigate the conflict between accuracy and efficiency. The proposed model comprises an efficient transformer encoder for generating hierarchical features and a semantic-balanced decoder for predicting semantic masks. In the encoder, a dimension-pooling mechanism is used in the multi-head self-attention (MHSA) to reduce the computational cost, and a parallel depth-wise convolution is used to capture local semantics. Simultaneously, we further apply this dimension-pooling attention (DPA) to the decoder as a refinement module to integrate multi-level features. With such a simple yet powerful encoder-decoder framework, we empirically demonstrate that the proposed SDPT achieves excellent performance and efficiency on various popular benchmarks, including ADE20K, Cityscapes, and COCO-Stuff. For example, our SDPT achieves 48.6% mIOU on the ADE20K dataset, which outperforms the current methods with fewer computational costs. The codes can be found at https://github.com/HuCaoFighting/SDPT. |
| Persistent Identifier | http://hdl.handle.net/10722/362075 |
| ISSN | 2023 Impact Factor: 7.9 2023 SCImago Journal Rankings: 2.580 |
| DC Field | Value | Language |
|---|---|---|
| dc.contributor.author | Cao, Hu | - |
| dc.contributor.author | Chen, Guang | - |
| dc.contributor.author | Zhao, Hengshuang | - |
| dc.contributor.author | Jiang, Dongsheng | - |
| dc.contributor.author | Zhang, Xiaopeng | - |
| dc.contributor.author | Tian, Qi | - |
| dc.contributor.author | Knoll, Alois | - |
| dc.date.accessioned | 2025-09-19T00:31:39Z | - |
| dc.date.available | 2025-09-19T00:31:39Z | - |
| dc.date.issued | 2024-07-03 | - |
| dc.identifier.citation | IEEE Transactions on Intelligence Transportation Systems, 2024, v. 25, n. 11, p. 15934-15946 | - |
| dc.identifier.issn | 1524-9050 | - |
| dc.identifier.uri | http://hdl.handle.net/10722/362075 | - |
| dc.description.abstract | Image segmentation plays a critical role in autonomous driving by providing vehicles with a detailed and accurate understanding of their surroundings. Transformers have recently shown encouraging results in image segmentation. However, transformer-based models are challenging to strike a better balance between performance and efficiency. The computational complexity of the transformer-based models is quadratic with the number of inputs, which severely hinders their application in dense prediction tasks. In this paper, we present the semantic-aware dimension-pooling transformer (SDPT) to mitigate the conflict between accuracy and efficiency. The proposed model comprises an efficient transformer encoder for generating hierarchical features and a semantic-balanced decoder for predicting semantic masks. In the encoder, a dimension-pooling mechanism is used in the multi-head self-attention (MHSA) to reduce the computational cost, and a parallel depth-wise convolution is used to capture local semantics. Simultaneously, we further apply this dimension-pooling attention (DPA) to the decoder as a refinement module to integrate multi-level features. With such a simple yet powerful encoder-decoder framework, we empirically demonstrate that the proposed SDPT achieves excellent performance and efficiency on various popular benchmarks, including ADE20K, Cityscapes, and COCO-Stuff. For example, our SDPT achieves 48.6% mIOU on the ADE20K dataset, which outperforms the current methods with fewer computational costs. The codes can be found at https://github.com/HuCaoFighting/SDPT. | - |
| dc.language | eng | - |
| dc.publisher | IEEE | - |
| dc.relation.ispartof | IEEE Transactions on Intelligence Transportation Systems | - |
| dc.rights | This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. | - |
| dc.subject | dimension-pooling attention | - |
| dc.subject | Image segmentation | - |
| dc.subject | scene understanding | - |
| dc.subject | semantic-balanced decoder | - |
| dc.subject | vision transformer | - |
| dc.title | SDPT: Semantic-Aware Dimension-Pooling Transformer for Image Segmentation | - |
| dc.type | Article | - |
| dc.identifier.doi | 10.1109/TITS.2024.3417813 | - |
| dc.identifier.scopus | eid_2-s2.0-85203146661 | - |
| dc.identifier.volume | 25 | - |
| dc.identifier.issue | 11 | - |
| dc.identifier.spage | 15934 | - |
| dc.identifier.epage | 15946 | - |
| dc.identifier.eissn | 1558-0016 | - |
| dc.identifier.issnl | 1524-9050 | - |
