File Download
There are no files associated with this item.
Links for fulltext
(May Require Subscription)
- Publisher Website: 10.1007/978-3-031-26316-3_30
- Scopus: eid_2-s2.0-85151065963
- Find via
Supplementary
-
Citations:
- Scopus: 0
- Appears in Collections:
Conference Paper: HaViT: Hybrid-Attention Based Vision Transformer for Video Classification
Title | HaViT: Hybrid-Attention Based Vision Transformer for Video Classification |
---|---|
Authors | |
Issue Date | 2023 |
Citation | Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2023, v. 13844 LNCS, p. 502-517 How to Cite? |
Abstract | Video transformers have become a promising tool for video classification due to its great success in modeling long-range interactions through the self-attention operation. However, existing transformer models only exploit the patch dependencies within a video when doing self-attention, while ignoring the patch dependencies across the different videos. This paper argues that external patch prior information is beneficial to the performance of video transformer models for video classification. Motivated by this assumption, this paper proposes a novel Hybrid-attention based Vision Transformer (HaViT) model for video classification, which explicitly exploits both internal patch dependencies within a video and external patch dependencies across videos. Different from existing self-attention, the hybrid-attention is computed based on internal patch tokens and an external patch token dictionary which encodes external patch prior information across the different videos. Experiments on Kinetics-400, Kinetics-600, and Something-Something v2 show that our HaViT model achieves state-of-the-art performance in the video classification task against existing methods. Moreover, experiments show that our proposed hybrid-attention scheme can be integrated into existing video transformer models to improve the performance. |
Persistent Identifier | http://hdl.handle.net/10722/345315 |
ISSN | 2023 SCImago Journal Rankings: 0.606 |
DC Field | Value | Language |
---|---|---|
dc.contributor.author | Li, Li | - |
dc.contributor.author | Zhuang, Liansheng | - |
dc.contributor.author | Gao, Shenghua | - |
dc.contributor.author | Wang, Shafei | - |
dc.date.accessioned | 2024-08-15T09:26:34Z | - |
dc.date.available | 2024-08-15T09:26:34Z | - |
dc.date.issued | 2023 | - |
dc.identifier.citation | Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2023, v. 13844 LNCS, p. 502-517 | - |
dc.identifier.issn | 0302-9743 | - |
dc.identifier.uri | http://hdl.handle.net/10722/345315 | - |
dc.description.abstract | Video transformers have become a promising tool for video classification due to its great success in modeling long-range interactions through the self-attention operation. However, existing transformer models only exploit the patch dependencies within a video when doing self-attention, while ignoring the patch dependencies across the different videos. This paper argues that external patch prior information is beneficial to the performance of video transformer models for video classification. Motivated by this assumption, this paper proposes a novel Hybrid-attention based Vision Transformer (HaViT) model for video classification, which explicitly exploits both internal patch dependencies within a video and external patch dependencies across videos. Different from existing self-attention, the hybrid-attention is computed based on internal patch tokens and an external patch token dictionary which encodes external patch prior information across the different videos. Experiments on Kinetics-400, Kinetics-600, and Something-Something v2 show that our HaViT model achieves state-of-the-art performance in the video classification task against existing methods. Moreover, experiments show that our proposed hybrid-attention scheme can be integrated into existing video transformer models to improve the performance. | - |
dc.language | eng | - |
dc.relation.ispartof | Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) | - |
dc.title | HaViT: Hybrid-Attention Based Vision Transformer for Video Classification | - |
dc.type | Conference_Paper | - |
dc.description.nature | link_to_subscribed_fulltext | - |
dc.identifier.doi | 10.1007/978-3-031-26316-3_30 | - |
dc.identifier.scopus | eid_2-s2.0-85151065963 | - |
dc.identifier.volume | 13844 LNCS | - |
dc.identifier.spage | 502 | - |
dc.identifier.epage | 517 | - |
dc.identifier.eissn | 1611-3349 | - |