File Download
There are no files associated with this item.
Links for fulltext
(May Require Subscription)
- Publisher Website: 10.1109/TPAMI.2024.3468640
- Scopus: eid_2-s2.0-85204975170
- Find via

Supplementary
-
Citations:
- Scopus: 0
- Appears in Collections:
Article: Language-Aware Vision Transformer for Referring Segmentation
| Title | Language-Aware Vision Transformer for Referring Segmentation |
|---|---|
| Authors | |
| Keywords | Language-Aware Vision Transformer Multi-Modal Understanding Referring Segmentation |
| Issue Date | 25-Sep-2024 |
| Publisher | Institute of Electrical and Electronics Engineers |
| Citation | IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024 How to Cite? |
| Abstract | Referring segmentation is a fundamental vision-language task that aims to segment out an object from an image or video in accordance with a natural language description. One of the key challenges behind this task is leveraging the referring expression for highlighting relevant positions in the image or video frames. A paradigm for tackling this problem in both the image and the video domains is to leverage a powerful vision-language ('cross-modal') decoder to fuse features independently extracted from a vision encoder and a language encoder. Recent methods have made remarkable advances in this paradigm by exploiting Transformers as cross-modal decoders, concurrent to the Transformer's overwhelming success in many other vision-language tasks. Adopting a different approach in this work, we show that significantly better cross-modal alignments can be achieved through the early fusion of linguistic and visual features in intermediate layers of a vision Transformer encoder network. Based on the idea of conducting cross-modal feature fusion in the visual feature encoding stage, we propose a unified framework named Language-Aware Vision Transformer (LAVT), which leverages the well-proven correlation modeling power of a Transformer encoder for excavating helpful multi-modal context. This way, accurate segmentation results can be harvested with a light-weight mask predictor. One of the key components in the proposed system is a dense attention mechanism for collecting pixel-specific linguistic cues. When dealing with video inputs, we present the video LAVT framework and design a 3D version of this component by introducing multi-scale convolutional operators arranged in a parallel fashion, which can exploit spatio-temporal dependencies at different granularity levels. We further introduce unified LAVT as a unified framework capable of handling both image and video inputs, with enhanced segmentation capabilities for the unified referring segmentation task. Our methods surpass previous state-of-the-art methods on seven benchmarks for referring image segmentation and referring video segmentation. The code to reproduce our experiments is available at LAVT-RS. |
| Persistent Identifier | http://hdl.handle.net/10722/362069 |
| ISSN | 2023 Impact Factor: 20.8 2023 SCImago Journal Rankings: 6.158 |
| DC Field | Value | Language |
|---|---|---|
| dc.contributor.author | Yang, Zhao | - |
| dc.contributor.author | Wang, Jiaqi | - |
| dc.contributor.author | Ye, Xubing | - |
| dc.contributor.author | Tang, Yansong | - |
| dc.contributor.author | Chen, Kai | - |
| dc.contributor.author | Zhao, Hengshuang | - |
| dc.contributor.author | Torr, Philip H.S. | - |
| dc.date.accessioned | 2025-09-19T00:31:36Z | - |
| dc.date.available | 2025-09-19T00:31:36Z | - |
| dc.date.issued | 2024-09-25 | - |
| dc.identifier.citation | IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024 | - |
| dc.identifier.issn | 0162-8828 | - |
| dc.identifier.uri | http://hdl.handle.net/10722/362069 | - |
| dc.description.abstract | <p>Referring segmentation is a fundamental vision-language task that aims to segment out an object from an image or video in accordance with a natural language description. One of the key challenges behind this task is leveraging the referring expression for highlighting relevant positions in the image or video frames. A paradigm for tackling this problem in both the image and the video domains is to leverage a powerful vision-language ('cross-modal') decoder to fuse features independently extracted from a vision encoder and a language encoder. Recent methods have made remarkable advances in this paradigm by exploiting Transformers as cross-modal decoders, concurrent to the Transformer's overwhelming success in many other vision-language tasks. Adopting a different approach in this work, we show that significantly better cross-modal alignments can be achieved through the early fusion of linguistic and visual features in intermediate layers of a vision Transformer encoder network. Based on the idea of conducting cross-modal feature fusion in the visual feature encoding stage, we propose a unified framework named Language-Aware Vision Transformer (LAVT), which leverages the well-proven correlation modeling power of a Transformer encoder for excavating helpful multi-modal context. This way, accurate segmentation results can be harvested with a light-weight mask predictor. One of the key components in the proposed system is a dense attention mechanism for collecting pixel-specific linguistic cues. When dealing with video inputs, we present the video LAVT framework and design a 3D version of this component by introducing multi-scale convolutional operators arranged in a parallel fashion, which can exploit spatio-temporal dependencies at different granularity levels. We further introduce unified LAVT as a unified framework capable of handling both image and video inputs, with enhanced segmentation capabilities for the unified referring segmentation task. Our methods surpass previous state-of-the-art methods on seven benchmarks for referring image segmentation and referring video segmentation. The code to reproduce our experiments is available at LAVT-RS. </p> | - |
| dc.language | eng | - |
| dc.publisher | Institute of Electrical and Electronics Engineers | - |
| dc.relation.ispartof | IEEE Transactions on Pattern Analysis and Machine Intelligence | - |
| dc.rights | This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. | - |
| dc.subject | Language-Aware Vision Transformer | - |
| dc.subject | Multi-Modal Understanding | - |
| dc.subject | Referring Segmentation | - |
| dc.title | Language-Aware Vision Transformer for Referring Segmentation | - |
| dc.type | Article | - |
| dc.identifier.doi | 10.1109/TPAMI.2024.3468640 | - |
| dc.identifier.scopus | eid_2-s2.0-85204975170 | - |
| dc.identifier.eissn | 1939-3539 | - |
| dc.identifier.issnl | 0162-8828 | - |
