Deep learning frameworks for sign language recognition

Zhou, Zhenxing; 周振星

File Download

FullText.pdf

Supplementary

Citations:
Appears in Collections:
- HKU Theses Online
- Electrical & Electronic Engineering: Theses

postgraduate thesis: Deep learning frameworks for sign language recognition

Title	Deep learning frameworks for sign language recognition
Authors	Zhou, Zhenxing 周振星
Issue Date	2023
Publisher	The University of Hong Kong (Pokfulam, Hong Kong)
Citation	Zhou, Z. [周振星]. (2023). Deep learning frameworks for sign language recognition. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.
Abstract	Sign language recognition (SLR) is an important yet challenging research area for breaking down the communication barriers between the hearing and the deaf. Existing solutions are still insufficient due to the high complexity of SLR involving the technologies of signal processing, gesture recognition and language modelling. Therefore, this thesis aims at developing effective deep learning frameworks for different directions in SLR including isolated SLR (ISLR), continuous vision-based SLR (CVSLR), continuous gesture-based SLR (CGSLR) and continuous multimodal SLR (CMSLR). First, ISLR is focused on recognizing each sign language gloss independently. Existing methods typically rely on the conventional 2D convolutional neural networks for ISLR while ignoring the important step of frame selection, which severely hinders them from improving the recognition accuracy. Accordingly, we propose a new (3+2+1)D ResNet model which integrates the (2+1)D convolution and 3D convolution to effectively analyze the spatial and temporal information while adopting an original frame selection mechanism for denoising the raw data. The experimental results on a ISLR dataset show that this model attains a better performance than other existing methods. Second, CVSLR deciphers the sequences of sign language glosses through the visual information collected from RGB cameras. Existing approaches generally model the underlying sign languages through the long short-term memory (LSTM) networks while none of them considers to adopt the more powerful bidirectional encoder representations from transformers (BERT) for CVSLR. Moreover, the performance of these approaches can be easily affected by the non-standard glosses in CVSLR. Therefore, we introduce a pioneering SignBERT framework which utilizes an updated BERT model for CVSLR. Furthermore, two masked training methods are introduced to enhance the robustness of this framework against the incorrect glosses. The experimental results on four challenging datasets reveal that lower word error rates (WERs) are achieved by the SignBERT when compared with those of other approaches in CVSLR. Third, CGSLR is targeted at recognizing the sentences of sign languages through smart watches. The biggest challenge in CGSLR is the insufficient amount of data provided by the smart watches. Thereby, we introduce a bidirectional LSTM-based multi-feature framework to address this challenge by extracting multiple sets of input representations from the smart watch data with a bidirectional LSTM for language modelling. The experimental results indicate that this framework outperforms other approaches in CGSLR. Lastly, CMSLR is a promising direction which utilizes multiple input modalities such as RGB videos and smart watch data for SLR. Existing approaches in CMSLR typically integrate the information acquired from diverse modalities through immutable weights while providing no chance for the knowledge exchange among these modalities for which the advantages of multiple input modalities are underutilized. To tackle this issue, we introduce an original deep learning framework named the CA-SignBERT. In this framework, an innovative cross-attention mechanism is proposed for exchanging the information among different modalities while a special weight control module is introduced to adaptively hybridize the outputs from these modalities. The experimental results demonstrate that this framework attains significantly lower WERs on four benchmark datasets when compared to those of the state-of-the-art approaches. First, ISLR is focused on recognizing each sign language gloss independently. Existing methods typically rely on the conventional 2D convolutional neural networks for ISLR while ignoring the important step of frame selection, which severely hinders them from improving the recognition accuracy. Accordingly, we propose a new (3+2+1)D ResNet model which integrates the (2+1)D convolution and 3D convolution to effectively analyze the spatial and temporal information while adopting an original frame selection mechanism for denoising the raw data. The experimental results on a ISLR dataset show that this model attains a better performance than other existing methods. Second, CVSLR deciphers the sequences of sign language glosses through the visual information collected from RGB cameras. Existing approaches generally model the underlying sign languages through the long short-term memory~(LSTM) networks while none of them considers to adopt the more powerful bidirectional encoder representations from transformers~(BERT) for CVSLR. Moreover, the performance of these approaches can be easily affected by the non-standard glosses in CVSLR. Therefore, we introduce a pioneering SignBERT framework which utilizes an updated BERT model for CVSLR. Furthermore, two masked training methods are introduced to enhance the robustness of this framework against the incorrect glosses. The experimental results on four challenging datasets reveal that lower word error rates~(WERs) are achieved by the SignBERT when compared with those of other approaches in CVSLR. Third, CGSLR is targeted at recognizing the sentences of sign languages through smart watches. The biggest challenge in CGSLR is the insufficient amount of data provided by the smart watches. Thereby, we introduce a bidirectional LSTM-based multi-feature framework to address this challenge by extracting multiple sets of input representations from the smart watch data with a bidirectional LSTM for language modelling. The experimental results indicate that this framework outperforms other approaches in CGSLR. Lastly, CMSLR is a promising direction which utilizes multiple input modalities such as RGB videos and smart watch data for SLR. Existing approaches in CMSLR typically integrate the information acquired from diverse modalities through immutable weights while providing no chance for the knowledge exchange among these modalities for which the advantages of multiple input modalities are underutilized. To tackle this issue, we introduce an original deep learning framework named the CA-SignBERT. In this framework, an innovative cross-attention mechanism is proposed for exchanging the information among different modalities while a special weight control module is introduced to adaptively hybridize the outputs from these modalities. The experimental results demonstrate that this framework attains significantly lower WERs on four benchmark datasets when compared to those of the state-of-the-art approaches.
Degree	Doctor of Philosophy
Subject	Sign language - Data processing Deep learning (Machine learning)
Dept/Program	Electrical and Electronic Engineering
Persistent Identifier	http://hdl.handle.net/10722/325725

DC Field	Value	Language
dc.contributor.author	Zhou, Zhenxing	-
dc.contributor.author	周振星	-
dc.date.accessioned	2023-03-02T16:32:20Z	-
dc.date.available	2023-03-02T16:32:20Z	-
dc.date.issued	2023	-
dc.identifier.citation	Zhou, Z. [周振星]. (2023). Deep learning frameworks for sign language recognition. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.	-
dc.identifier.uri	http://hdl.handle.net/10722/325725	-
dc.description.abstract	Sign language recognition (SLR) is an important yet challenging research area for breaking down the communication barriers between the hearing and the deaf. Existing solutions are still insufficient due to the high complexity of SLR involving the technologies of signal processing, gesture recognition and language modelling. Therefore, this thesis aims at developing effective deep learning frameworks for different directions in SLR including isolated SLR (ISLR), continuous vision-based SLR (CVSLR), continuous gesture-based SLR (CGSLR) and continuous multimodal SLR (CMSLR). First, ISLR is focused on recognizing each sign language gloss independently. Existing methods typically rely on the conventional 2D convolutional neural networks for ISLR while ignoring the important step of frame selection, which severely hinders them from improving the recognition accuracy. Accordingly, we propose a new (3+2+1)D ResNet model which integrates the (2+1)D convolution and 3D convolution to effectively analyze the spatial and temporal information while adopting an original frame selection mechanism for denoising the raw data. The experimental results on a ISLR dataset show that this model attains a better performance than other existing methods. Second, CVSLR deciphers the sequences of sign language glosses through the visual information collected from RGB cameras. Existing approaches generally model the underlying sign languages through the long short-term memory (LSTM) networks while none of them considers to adopt the more powerful bidirectional encoder representations from transformers (BERT) for CVSLR. Moreover, the performance of these approaches can be easily affected by the non-standard glosses in CVSLR. Therefore, we introduce a pioneering SignBERT framework which utilizes an updated BERT model for CVSLR. Furthermore, two masked training methods are introduced to enhance the robustness of this framework against the incorrect glosses. The experimental results on four challenging datasets reveal that lower word error rates (WERs) are achieved by the SignBERT when compared with those of other approaches in CVSLR. Third, CGSLR is targeted at recognizing the sentences of sign languages through smart watches. The biggest challenge in CGSLR is the insufficient amount of data provided by the smart watches. Thereby, we introduce a bidirectional LSTM-based multi-feature framework to address this challenge by extracting multiple sets of input representations from the smart watch data with a bidirectional LSTM for language modelling. The experimental results indicate that this framework outperforms other approaches in CGSLR. Lastly, CMSLR is a promising direction which utilizes multiple input modalities such as RGB videos and smart watch data for SLR. Existing approaches in CMSLR typically integrate the information acquired from diverse modalities through immutable weights while providing no chance for the knowledge exchange among these modalities for which the advantages of multiple input modalities are underutilized. To tackle this issue, we introduce an original deep learning framework named the CA-SignBERT. In this framework, an innovative cross-attention mechanism is proposed for exchanging the information among different modalities while a special weight control module is introduced to adaptively hybridize the outputs from these modalities. The experimental results demonstrate that this framework attains significantly lower WERs on four benchmark datasets when compared to those of the state-of-the-art approaches. First, ISLR is focused on recognizing each sign language gloss independently. Existing methods typically rely on the conventional 2D convolutional neural networks for ISLR while ignoring the important step of frame selection, which severely hinders them from improving the recognition accuracy. Accordingly, we propose a new (3+2+1)D ResNet model which integrates the (2+1)D convolution and 3D convolution to effectively analyze the spatial and temporal information while adopting an original frame selection mechanism for denoising the raw data. The experimental results on a ISLR dataset show that this model attains a better performance than other existing methods. Second, CVSLR deciphers the sequences of sign language glosses through the visual information collected from RGB cameras. Existing approaches generally model the underlying sign languages through the long short-term memory~(LSTM) networks while none of them considers to adopt the more powerful bidirectional encoder representations from transformers~(BERT) for CVSLR. Moreover, the performance of these approaches can be easily affected by the non-standard glosses in CVSLR. Therefore, we introduce a pioneering SignBERT framework which utilizes an updated BERT model for CVSLR. Furthermore, two masked training methods are introduced to enhance the robustness of this framework against the incorrect glosses. The experimental results on four challenging datasets reveal that lower word error rates~(WERs) are achieved by the SignBERT when compared with those of other approaches in CVSLR. Third, CGSLR is targeted at recognizing the sentences of sign languages through smart watches. The biggest challenge in CGSLR is the insufficient amount of data provided by the smart watches. Thereby, we introduce a bidirectional LSTM-based multi-feature framework to address this challenge by extracting multiple sets of input representations from the smart watch data with a bidirectional LSTM for language modelling. The experimental results indicate that this framework outperforms other approaches in CGSLR. Lastly, CMSLR is a promising direction which utilizes multiple input modalities such as RGB videos and smart watch data for SLR. Existing approaches in CMSLR typically integrate the information acquired from diverse modalities through immutable weights while providing no chance for the knowledge exchange among these modalities for which the advantages of multiple input modalities are underutilized. To tackle this issue, we introduce an original deep learning framework named the CA-SignBERT. In this framework, an innovative cross-attention mechanism is proposed for exchanging the information among different modalities while a special weight control module is introduced to adaptively hybridize the outputs from these modalities. The experimental results demonstrate that this framework attains significantly lower WERs on four benchmark datasets when compared to those of the state-of-the-art approaches.	-
dc.language	eng	-
dc.publisher	The University of Hong Kong (Pokfulam, Hong Kong)	-
dc.relation.ispartof	HKU Theses Online (HKUTO)	-
dc.rights	The author retains all proprietary rights, (such as patent rights) and the right to use in future works.	-
dc.rights	This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.	-
dc.subject.lcsh	Sign language - Data processing	-
dc.subject.lcsh	Deep learning (Machine learning)	-
dc.title	Deep learning frameworks for sign language recognition	-
dc.type	PG_Thesis	-
dc.description.thesisname	Doctor of Philosophy	-
dc.description.thesislevel	Doctoral	-
dc.description.thesisdiscipline	Electrical and Electronic Engineering	-
dc.description.nature	published_or_final_version	-
dc.date.hkucongregation	2023	-
dc.identifier.mmsid	991044649899303414	-

File Download

Supplementary

postgraduate thesis: Deep learning frameworks for sign language recognition

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats