File Download

There are no files associated with this item.

  Links for fulltext
     (May Require Subscription)
Supplementary

Article: VT-BPAN: vision transformer-based bilinear pooling and attention network fusion of RGB and skeleton features for human action recognition

TitleVT-BPAN: vision transformer-based bilinear pooling and attention network fusion of RGB and skeleton features for human action recognition
Authors
KeywordsFeature fusion
Microsoft kinect camera
Multi-head pooling attention
Self-attention
Vision transformer
Issue Date11-Dec-2023
PublisherSpringer
Citation
Multimedia Tools and Applications, 2023 How to Cite?
Abstract

Recent generation Microsoft Kinect Camera captures a series of multimodal signals that provide RGB video, depth sequences, and skeleton information, thus it becomes an option to achieve enhanced human action recognition performance by fusing different data modalities. However, most existing fusion methods simply fuse different features, which ignores the underlying semantics between different models, leading to a lack of accuracy. In addition, there exists a large amount of background noise. In this work, we propose a Vision Transformer-based Bilinear Pooling and Attention Network (VT-BPAN) fusion mechanism for human action recognition. This work improves the recognition accuracy in the following ways: 1) An effective two-stream feature pooling and fusion mechanism is proposed. The RGB frames and skeleton are fused to enhance the spatio-temporal feature representation. 2) A spatial lightweight multiscale vision Transformer is proposed, which can reduce the cost of computing. The framework is evaluated based on three widely used video action datasets, and the proposed approach performs a more comparable performance with the state-of-the-art methods.


Persistent Identifierhttp://hdl.handle.net/10722/345581
ISSN
2023 Impact Factor: 3.0
2023 SCImago Journal Rankings: 0.801

 

DC FieldValueLanguage
dc.contributor.authorSun, Yaohui-
dc.contributor.authorXu, Weiyao-
dc.contributor.authorYu, Xiaoyi-
dc.contributor.authorGao, Ju-
dc.date.accessioned2024-08-27T09:09:48Z-
dc.date.available2024-08-27T09:09:48Z-
dc.date.issued2023-12-11-
dc.identifier.citationMultimedia Tools and Applications, 2023-
dc.identifier.issn1380-7501-
dc.identifier.urihttp://hdl.handle.net/10722/345581-
dc.description.abstract<p>Recent generation Microsoft Kinect Camera captures a series of multimodal signals that provide RGB video, depth sequences, and skeleton information, thus it becomes an option to achieve enhanced human action recognition performance by fusing different data modalities. However, most existing fusion methods simply fuse different features, which ignores the underlying semantics between different models, leading to a lack of accuracy. In addition, there exists a large amount of background noise. In this work, we propose a Vision Transformer-based Bilinear Pooling and Attention Network (VT-BPAN) fusion mechanism for human action recognition. This work improves the recognition accuracy in the following ways: 1) An effective two-stream feature pooling and fusion mechanism is proposed. The RGB frames and skeleton are fused to enhance the spatio-temporal feature representation. 2) A spatial lightweight multiscale vision Transformer is proposed, which can reduce the cost of computing. The framework is evaluated based on three widely used video action datasets, and the proposed approach performs a more comparable performance with the state-of-the-art methods.</p>-
dc.languageeng-
dc.publisherSpringer-
dc.relation.ispartofMultimedia Tools and Applications-
dc.subjectFeature fusion-
dc.subjectMicrosoft kinect camera-
dc.subjectMulti-head pooling attention-
dc.subjectSelf-attention-
dc.subjectVision transformer-
dc.titleVT-BPAN: vision transformer-based bilinear pooling and attention network fusion of RGB and skeleton features for human action recognition-
dc.typeArticle-
dc.identifier.doi10.1007/s11042-023-17788-3-
dc.identifier.scopuseid_2-s2.0-85179301800-
dc.identifier.eissn1573-7721-
dc.identifier.issnl1380-7501-

Export via OAI-PMH Interface in XML Formats


OR


Export to Other Non-XML Formats