Sign Language Recognition Based on R(2+1)D with Spatial-Temporal-Channel Attention

Han, Xiangzu; Lu, Fei; Yin, Jianqin; Tian, Guohui; Liu, Jun

File Download

There are no files associated with this item.

Links for fulltext

(May Require Subscription)

Publisher Website: 10.1109/THMS.2022.3144000
Scopus: eid_2-s2.0-85124184340
Find via

Supplementary

Citations:
- Scopus: 0
Appears in Collections:
- Industrial & Manufacturing Systems Engineering: Journal/Magazine Articles

Article: Sign Language Recognition Based on R(2+1)D with Spatial-Temporal-Channel Attention

Title	Sign Language Recognition Based on R(2+1)D with Spatial-Temporal-Channel Attention
Authors	Han, Xiangzu Lu, Fei Yin, Jianqin Tian, Guohui Liu, Jun
Keywords	Attention mechanism R(2+1)D sign language recognition (SLR)
Issue Date	2022
Citation	IEEE Transactions on Human-Machine Systems, 2022, v. 52, n. 4, p. 687-698 How to Cite? DOI: http://dx.doi.org/10.1109/THMS.2022.3144000
Abstract	Previous work utilized three-dimensional (3-D) convolutional neural networks (CNNs) tomodel the spatial appearance and temporal evolution concurrently for sign language recognition (SLR) and exhibited impressive performance. However, there are still challenges for 3-D CNN-based methods. First, motion information plays a more significant role than spatial content in sign language. Therefore, it is still questionable whether to treat space and time equally and model them jointly by heavy 3-D convolutions in a unified approach. Second, because of the interference from the highly redundant information in sign videos, it is still nontrivial to effectively extract discriminative spatiotemporal features related to sign language. In this study, deep R(2+1)D was adopted for separate spatial and temporal modeling and demonstrated that decomposing 3-D convolution filters into independent spatial and temporal convolutions facilitates the optimization process in SLR. A lightweight spatial-Temporal-channel attention module, including two submodules called channel-Temporal attention and spatial-Temporal attention, was proposed to make the network concentrate on the significant information along spatial, temporal, and channel dimensions by combining squeeze and excitation attention with self-Attention. By embedding this module into R(2+1)D, superior or comparable results to the state-of-The-Art methods on the CSL-500, Jester, and EgoGesture datasets were obtained, which demonstrated the effectiveness of the proposed method.
Persistent Identifier	http://hdl.handle.net/10722/349686
ISSN	2168-2291 2023 Impact Factor: 3.5 2023 SCImago Journal Rankings: 1.139

DC Field	Value	Language
dc.contributor.author	Han, Xiangzu	-
dc.contributor.author	Lu, Fei	-
dc.contributor.author	Yin, Jianqin	-
dc.contributor.author	Tian, Guohui	-
dc.contributor.author	Liu, Jun	-
dc.date.accessioned	2024-10-17T07:00:08Z	-
dc.date.available	2024-10-17T07:00:08Z	-
dc.date.issued	2022	-
dc.identifier.citation	IEEE Transactions on Human-Machine Systems, 2022, v. 52, n. 4, p. 687-698	-
dc.identifier.issn	2168-2291	-
dc.identifier.uri	http://hdl.handle.net/10722/349686	-
dc.description.abstract	Previous work utilized three-dimensional (3-D) convolutional neural networks (CNNs) tomodel the spatial appearance and temporal evolution concurrently for sign language recognition (SLR) and exhibited impressive performance. However, there are still challenges for 3-D CNN-based methods. First, motion information plays a more significant role than spatial content in sign language. Therefore, it is still questionable whether to treat space and time equally and model them jointly by heavy 3-D convolutions in a unified approach. Second, because of the interference from the highly redundant information in sign videos, it is still nontrivial to effectively extract discriminative spatiotemporal features related to sign language. In this study, deep R(2+1)D was adopted for separate spatial and temporal modeling and demonstrated that decomposing 3-D convolution filters into independent spatial and temporal convolutions facilitates the optimization process in SLR. A lightweight spatial-Temporal-channel attention module, including two submodules called channel-Temporal attention and spatial-Temporal attention, was proposed to make the network concentrate on the significant information along spatial, temporal, and channel dimensions by combining squeeze and excitation attention with self-Attention. By embedding this module into R(2+1)D, superior or comparable results to the state-of-The-Art methods on the CSL-500, Jester, and EgoGesture datasets were obtained, which demonstrated the effectiveness of the proposed method.	-
dc.language	eng	-
dc.relation.ispartof	IEEE Transactions on Human-Machine Systems	-
dc.subject	Attention mechanism	-
dc.subject	R(2+1)D	-
dc.subject	sign language recognition (SLR)	-
dc.title	Sign Language Recognition Based on R(2+1)D with Spatial-Temporal-Channel Attention	-
dc.type	Article	-
dc.description.nature	link_to_subscribed_fulltext	-
dc.identifier.doi	10.1109/THMS.2022.3144000	-
dc.identifier.scopus	eid_2-s2.0-85124184340	-
dc.identifier.volume	52	-
dc.identifier.issue	4	-
dc.identifier.spage	687	-
dc.identifier.epage	698	-
dc.identifier.eissn	2168-2305	-

File Download

Links for fulltext

(May Require Subscription)

Supplementary

Article: Sign Language Recognition Based on R(2+1)D with Spatial-Temporal-Channel Attention

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats