File Download

There are no files associated with this item.

  Links for fulltext
     (May Require Subscription)
Supplementary

Conference Paper: HaViT: Hybrid-Attention Based Vision Transformer for Video Classification

TitleHaViT: Hybrid-Attention Based Vision Transformer for Video Classification
Authors
Issue Date2023
Citation
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2023, v. 13844 LNCS, p. 502-517 How to Cite?
AbstractVideo transformers have become a promising tool for video classification due to its great success in modeling long-range interactions through the self-attention operation. However, existing transformer models only exploit the patch dependencies within a video when doing self-attention, while ignoring the patch dependencies across the different videos. This paper argues that external patch prior information is beneficial to the performance of video transformer models for video classification. Motivated by this assumption, this paper proposes a novel Hybrid-attention based Vision Transformer (HaViT) model for video classification, which explicitly exploits both internal patch dependencies within a video and external patch dependencies across videos. Different from existing self-attention, the hybrid-attention is computed based on internal patch tokens and an external patch token dictionary which encodes external patch prior information across the different videos. Experiments on Kinetics-400, Kinetics-600, and Something-Something v2 show that our HaViT model achieves state-of-the-art performance in the video classification task against existing methods. Moreover, experiments show that our proposed hybrid-attention scheme can be integrated into existing video transformer models to improve the performance.
Persistent Identifierhttp://hdl.handle.net/10722/345315
ISSN
2023 SCImago Journal Rankings: 0.606

 

DC FieldValueLanguage
dc.contributor.authorLi, Li-
dc.contributor.authorZhuang, Liansheng-
dc.contributor.authorGao, Shenghua-
dc.contributor.authorWang, Shafei-
dc.date.accessioned2024-08-15T09:26:34Z-
dc.date.available2024-08-15T09:26:34Z-
dc.date.issued2023-
dc.identifier.citationLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2023, v. 13844 LNCS, p. 502-517-
dc.identifier.issn0302-9743-
dc.identifier.urihttp://hdl.handle.net/10722/345315-
dc.description.abstractVideo transformers have become a promising tool for video classification due to its great success in modeling long-range interactions through the self-attention operation. However, existing transformer models only exploit the patch dependencies within a video when doing self-attention, while ignoring the patch dependencies across the different videos. This paper argues that external patch prior information is beneficial to the performance of video transformer models for video classification. Motivated by this assumption, this paper proposes a novel Hybrid-attention based Vision Transformer (HaViT) model for video classification, which explicitly exploits both internal patch dependencies within a video and external patch dependencies across videos. Different from existing self-attention, the hybrid-attention is computed based on internal patch tokens and an external patch token dictionary which encodes external patch prior information across the different videos. Experiments on Kinetics-400, Kinetics-600, and Something-Something v2 show that our HaViT model achieves state-of-the-art performance in the video classification task against existing methods. Moreover, experiments show that our proposed hybrid-attention scheme can be integrated into existing video transformer models to improve the performance.-
dc.languageeng-
dc.relation.ispartofLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)-
dc.titleHaViT: Hybrid-Attention Based Vision Transformer for Video Classification-
dc.typeConference_Paper-
dc.description.naturelink_to_subscribed_fulltext-
dc.identifier.doi10.1007/978-3-031-26316-3_30-
dc.identifier.scopuseid_2-s2.0-85151065963-
dc.identifier.volume13844 LNCS-
dc.identifier.spage502-
dc.identifier.epage517-
dc.identifier.eissn1611-3349-

Export via OAI-PMH Interface in XML Formats


OR


Export to Other Non-XML Formats