File Download
  Links for fulltext
     (May Require Subscription)
Supplementary

Article: Integrating Vision Transformer-Based Bilinear Pooling and Attention Network Fusion of RGB and Skeleton Features for Human Action Recognition

TitleIntegrating Vision Transformer-Based Bilinear Pooling and Attention Network Fusion of RGB and Skeleton Features for Human Action Recognition
Authors
KeywordsFeature fusion
Human action recognition
Multi-modal
Self-attention
Issue Date20-Jul-2023
PublisherAtlantis Press
Citation
International Journal of Computational Intelligence Systems, 2023, v. 16, n. 1, p. 1-11 How to Cite?
Abstract

In this paper, we propose VT-BPAN, a novel approach that combines the capabilities of Vision Transformer (VT), bilinear pooling, and attention network fusion for effective human action recognition (HAR). The proposed methodology significantly enhances the accuracy of activity recognition through the following advancements: (1) The introduction of an effective two-stream feature pooling and fusion mechanism that combines RGB frames and skeleton data to augment the spatial–temporal feature representation. (2) The development of a spatial lightweight vision transformer that mitigates computational costs. The evaluation of this framework encompasses three widely employed video action datasets, demonstrating that the proposed approach achieves performance on par with state-of-the-art methods.


Persistent Identifierhttp://hdl.handle.net/10722/345484
ISSN
2023 Impact Factor: 2.5
2023 SCImago Journal Rankings: 0.564

 

DC FieldValueLanguage
dc.contributor.authorSun, Yaohui-
dc.contributor.authorXu, Weiyao-
dc.contributor.authorYu, Xiaoyi-
dc.contributor.authorGao, Ju-
dc.contributor.authorXia, Ting-
dc.date.accessioned2024-08-27T09:09:02Z-
dc.date.available2024-08-27T09:09:02Z-
dc.date.issued2023-07-20-
dc.identifier.citationInternational Journal of Computational Intelligence Systems, 2023, v. 16, n. 1, p. 1-11-
dc.identifier.issn1875-6891-
dc.identifier.urihttp://hdl.handle.net/10722/345484-
dc.description.abstract<p>In this paper, we propose VT-BPAN, a novel approach that combines the capabilities of Vision Transformer (VT), bilinear pooling, and attention network fusion for effective human action recognition (HAR). The proposed methodology significantly enhances the accuracy of activity recognition through the following advancements: (1) The introduction of an effective two-stream feature pooling and fusion mechanism that combines RGB frames and skeleton data to augment the spatial–temporal feature representation. (2) The development of a spatial lightweight vision transformer that mitigates computational costs. The evaluation of this framework encompasses three widely employed video action datasets, demonstrating that the proposed approach achieves performance on par with state-of-the-art methods.</p>-
dc.languageeng-
dc.publisherAtlantis Press-
dc.relation.ispartofInternational Journal of Computational Intelligence Systems-
dc.rightsThis work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.-
dc.subjectFeature fusion-
dc.subjectHuman action recognition-
dc.subjectMulti-modal-
dc.subjectSelf-attention-
dc.titleIntegrating Vision Transformer-Based Bilinear Pooling and Attention Network Fusion of RGB and Skeleton Features for Human Action Recognition-
dc.typeArticle-
dc.description.naturepublished_or_final_version-
dc.identifier.doi10.1007/s44196-023-00292-9-
dc.identifier.scopuseid_2-s2.0-85165391112-
dc.identifier.volume16-
dc.identifier.issue1-
dc.identifier.spage1-
dc.identifier.epage11-
dc.identifier.eissn1875-6883-
dc.identifier.issnl1875-6883-

Export via OAI-PMH Interface in XML Formats


OR


Export to Other Non-XML Formats