Integrating Vision Transformer-Based Bilinear Pooling and Attention Network Fusion of RGB and Skeleton Features for Human Action Recognition

Sun, Yaohui; Xu, Weiyao; Yu, Xiaoyi; Gao, Ju; Xia, Ting

File Download

content.pdf

Links for fulltext

(May Require Subscription)

Publisher Website: 10.1007/s44196-023-00292-9
Scopus: eid_2-s2.0-85165391112
Find via

Supplementary

Citations:
- Scopus: 0
Appears in Collections:
- Physics: Journal/Magazine Articles

Article: Integrating Vision Transformer-Based Bilinear Pooling and Attention Network Fusion of RGB and Skeleton Features for Human Action Recognition

Title	Integrating Vision Transformer-Based Bilinear Pooling and Attention Network Fusion of RGB and Skeleton Features for Human Action Recognition
Authors	Sun, Yaohui Xu, Weiyao Yu, Xiaoyi Gao, Ju Xia, Ting
Keywords	Feature fusion Human action recognition Multi-modal Self-attention
Issue Date	20-Jul-2023
Publisher	Atlantis Press
Citation	International Journal of Computational Intelligence Systems, 2023, v. 16, n. 1, p. 1-11 How to Cite? DOI: http://dx.doi.org/10.1007/s44196-023-00292-9
Abstract	In this paper, we propose VT-BPAN, a novel approach that combines the capabilities of Vision Transformer (VT), bilinear pooling, and attention network fusion for effective human action recognition (HAR). The proposed methodology significantly enhances the accuracy of activity recognition through the following advancements: (1) The introduction of an effective two-stream feature pooling and fusion mechanism that combines RGB frames and skeleton data to augment the spatial–temporal feature representation. (2) The development of a spatial lightweight vision transformer that mitigates computational costs. The evaluation of this framework encompasses three widely employed video action datasets, demonstrating that the proposed approach achieves performance on par with state-of-the-art methods.
Persistent Identifier	http://hdl.handle.net/10722/345484
ISSN	1875-6891 2023 Impact Factor: 2.5 2023 SCImago Journal Rankings: 0.564

DC Field	Value	Language
dc.contributor.author	Sun, Yaohui	-
dc.contributor.author	Xu, Weiyao	-
dc.contributor.author	Yu, Xiaoyi	-
dc.contributor.author	Gao, Ju	-
dc.contributor.author	Xia, Ting	-
dc.date.accessioned	2024-08-27T09:09:02Z	-
dc.date.available	2024-08-27T09:09:02Z	-
dc.date.issued	2023-07-20	-
dc.identifier.citation	International Journal of Computational Intelligence Systems, 2023, v. 16, n. 1, p. 1-11	-
dc.identifier.issn	1875-6891	-
dc.identifier.uri	http://hdl.handle.net/10722/345484	-
dc.description.abstract	<p>In this paper, we propose VT-BPAN, a novel approach that combines the capabilities of Vision Transformer (VT), bilinear pooling, and attention network fusion for effective human action recognition (HAR). The proposed methodology significantly enhances the accuracy of activity recognition through the following advancements: (1) The introduction of an effective two-stream feature pooling and fusion mechanism that combines RGB frames and skeleton data to augment the spatial–temporal feature representation. (2) The development of a spatial lightweight vision transformer that mitigates computational costs. The evaluation of this framework encompasses three widely employed video action datasets, demonstrating that the proposed approach achieves performance on par with state-of-the-art methods.</p>	-
dc.language	eng	-
dc.publisher	Atlantis Press	-
dc.relation.ispartof	International Journal of Computational Intelligence Systems	-
dc.rights	This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.	-
dc.subject	Feature fusion	-
dc.subject	Human action recognition	-
dc.subject	Multi-modal	-
dc.subject	Self-attention	-
dc.title	Integrating Vision Transformer-Based Bilinear Pooling and Attention Network Fusion of RGB and Skeleton Features for Human Action Recognition	-
dc.type	Article	-
dc.description.nature	published_or_final_version	-
dc.identifier.doi	10.1007/s44196-023-00292-9	-
dc.identifier.scopus	eid_2-s2.0-85165391112	-
dc.identifier.volume	16	-
dc.identifier.issue	1	-
dc.identifier.spage	1	-
dc.identifier.epage	11	-
dc.identifier.eissn	1875-6883	-
dc.identifier.issnl	1875-6883	-

File Download

Links for fulltext

(May Require Subscription)

Supplementary

Article: Integrating Vision Transformer-Based Bilinear Pooling and Attention Network Fusion of RGB and Skeleton Features for Human Action Recognition

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats