HaViT: Hybrid-Attention Based Vision Transformer for Video Classification

Li, Li; Zhuang, Liansheng; Gao, Shenghua; Wang, Shafei

File Download

There are no files associated with this item.

Links for fulltext

(May Require Subscription)

Publisher Website: 10.1007/978-3-031-26316-3_30
Scopus: eid_2-s2.0-85151065963
WOS: WOS:001000822000030
Find via

Supplementary

Citations:
- Scopus: 0
- Web of Science: 0
Appears in Collections:
- Computer Science: Conference papers

Conference Paper: HaViT: Hybrid-Attention Based Vision Transformer for Video Classification

Title	HaViT: Hybrid-Attention Based Vision Transformer for Video Classification
Authors	Li, Li Zhuang, Liansheng Gao, Shenghua Wang, Shafei
Issue Date	2023
Citation	Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2023, v. 13844 LNCS, p. 502-517 How to Cite? DOI: http://dx.doi.org/10.1007/978-3-031-26316-3_30
Abstract	Video transformers have become a promising tool for video classification due to its great success in modeling long-range interactions through the self-attention operation. However, existing transformer models only exploit the patch dependencies within a video when doing self-attention, while ignoring the patch dependencies across the different videos. This paper argues that external patch prior information is beneficial to the performance of video transformer models for video classification. Motivated by this assumption, this paper proposes a novel Hybrid-attention based Vision Transformer (HaViT) model for video classification, which explicitly exploits both internal patch dependencies within a video and external patch dependencies across videos. Different from existing self-attention, the hybrid-attention is computed based on internal patch tokens and an external patch token dictionary which encodes external patch prior information across the different videos. Experiments on Kinetics-400, Kinetics-600, and Something-Something v2 show that our HaViT model achieves state-of-the-art performance in the video classification task against existing methods. Moreover, experiments show that our proposed hybrid-attention scheme can be integrated into existing video transformer models to improve the performance.
Persistent Identifier	http://hdl.handle.net/10722/345315
ISSN	0302-9743 2023 SCImago Journal Rankings: 0.606
ISI Accession Number ID	WOS:001000822000030

DC Field	Value	Language
dc.contributor.author	Li, Li	-
dc.contributor.author	Zhuang, Liansheng	-
dc.contributor.author	Gao, Shenghua	-
dc.contributor.author	Wang, Shafei	-
dc.date.accessioned	2024-08-15T09:26:34Z	-
dc.date.available	2024-08-15T09:26:34Z	-
dc.date.issued	2023	-
dc.identifier.citation	Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2023, v. 13844 LNCS, p. 502-517	-
dc.identifier.issn	0302-9743	-
dc.identifier.uri	http://hdl.handle.net/10722/345315	-
dc.description.abstract	Video transformers have become a promising tool for video classification due to its great success in modeling long-range interactions through the self-attention operation. However, existing transformer models only exploit the patch dependencies within a video when doing self-attention, while ignoring the patch dependencies across the different videos. This paper argues that external patch prior information is beneficial to the performance of video transformer models for video classification. Motivated by this assumption, this paper proposes a novel Hybrid-attention based Vision Transformer (HaViT) model for video classification, which explicitly exploits both internal patch dependencies within a video and external patch dependencies across videos. Different from existing self-attention, the hybrid-attention is computed based on internal patch tokens and an external patch token dictionary which encodes external patch prior information across the different videos. Experiments on Kinetics-400, Kinetics-600, and Something-Something v2 show that our HaViT model achieves state-of-the-art performance in the video classification task against existing methods. Moreover, experiments show that our proposed hybrid-attention scheme can be integrated into existing video transformer models to improve the performance.	-
dc.language	eng	-
dc.relation.ispartof	Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)	-
dc.title	HaViT: Hybrid-Attention Based Vision Transformer for Video Classification	-
dc.type	Conference_Paper	-
dc.description.nature	link_to_subscribed_fulltext	-
dc.identifier.doi	10.1007/978-3-031-26316-3_30	-
dc.identifier.scopus	eid_2-s2.0-85151065963	-
dc.identifier.volume	13844 LNCS	-
dc.identifier.spage	502	-
dc.identifier.epage	517	-
dc.identifier.eissn	1611-3349	-
dc.identifier.isi	WOS:001000822000030	-

File Download

Links for fulltext

(May Require Subscription)

Supplementary

Conference Paper: HaViT: Hybrid-Attention Based Vision Transformer for Video Classification

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats