Anger detection in speech data using machine learning methods

Wong, Chi Lok Enoch; 黃子諾

File Download

FullText.pdf

Supplementary

Citations:
Appears in Collections:
- HKU Theses Online
- Linguistics: Theses

postgraduate thesis: Anger detection in speech data using machine learning methods

Title	Anger detection in speech data using machine learning methods
Authors	Wong, Chi Lok Enoch 黃子諾
Issue Date	2023
Publisher	The University of Hong Kong (Pokfulam, Hong Kong)
Citation	Wong, C. L. E. [黃子諾]. (2023). Anger detection in speech data using machine learning methods. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.
Abstract	The advancement of technology has enabled a surge of speech data to be created, stored, and analysed. Speech, being central to human communications, is considered more direct than writing in expressing emotions. However, the specific research on emotions in speech is still in a relatively early phase. The noisy nature of spontaneous speech complicates its analysis. Rooted in the detection of anger in spontaneous speech, the current thesis compares different machine learning models to choose the best performing one for the detection of anger, and looks at the acoustic predictors it relies on most significantly for making decisions. Downton Abbey is chosen as a source of speech for our study, under the assumption that dialogues produced by professional actors are natural, and the emotions contained therein are spontaneous enough to resemble those in real life. A two-level model, comprising text and speech levels, is designed to ultimately train an acoustic emotion detector. On the first level, a textual sentiment analyser is used to assign emotion labels to the subtitles of the Downton Abbey episodes. The second level then takes in the emotion-annotated subtitles, and learns to detect emotions in the corresponding speech data. Successful learning depends on (a) correct time alignment of subtitles and corresponding spoken lines, and (b) reliable labels of emotions on the subtitles. One can analyse the trained model to understand which acoustic features best predict anger. On the first level, an off-the-shelf textual sentiment analyser did not work optimally, as the test data – the subtitles - appeared to be different in terms of language used from the linguistic material in the training set. Also, anger was defined too broadly in the training, beyond manifested anger detectable in speech. To tackle this issue, 10,000 lines of subtitles were manually annotated to train a dedicated textual sentiment analyser. Another major source of complexity comes from the spontaneous nature of speech, where background music, noise, and overlapping voices exist. Despite the use of different audio preprocessing methods, ranging from applying filters to pre-processing with deep learning models, the quality of speech remained mediocre for automatic treatment. Furthermore, as the machine learning models engaged in this study do not take raw audio as input, sets of acoustic features are extracted. These features stretch from simple phonetic descriptors, like intensity or pitch, to mel-frequency cepstral coefficients which account for humans’ non-linear audition. Since these spectral features used are highly susceptible to noise, the emotion information in speech may be obscured, which may hinder the performance of the acoustic anger detector. A comparative study on two similar emotions, namely, fear and disgust, is also conducted to investigate the different acoustic correlates of emotions. It is found that the significant features in the detection are vastly different from those of anger. The thesis offers insights into the roles of various acoustic features in the detection of anger. It also suggests that neural networks may require highly sophisticated fine-tuning to be capable of learning from a highly unbalanced dataset such as the one employed in the thesis.
Degree	Master of Philosophy
Subject	Anger Audio data mining Machine learning
Dept/Program	Linguistics
Persistent Identifier	http://hdl.handle.net/10722/335570

DC Field	Value	Language
dc.contributor.author	Wong, Chi Lok Enoch	-
dc.contributor.author	黃子諾	-
dc.date.accessioned	2023-11-30T06:22:41Z	-
dc.date.available	2023-11-30T06:22:41Z	-
dc.date.issued	2023	-
dc.identifier.citation	Wong, C. L. E. [黃子諾]. (2023). Anger detection in speech data using machine learning methods. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.	-
dc.identifier.uri	http://hdl.handle.net/10722/335570	-
dc.description.abstract	The advancement of technology has enabled a surge of speech data to be created, stored, and analysed. Speech, being central to human communications, is considered more direct than writing in expressing emotions. However, the specific research on emotions in speech is still in a relatively early phase. The noisy nature of spontaneous speech complicates its analysis. Rooted in the detection of anger in spontaneous speech, the current thesis compares different machine learning models to choose the best performing one for the detection of anger, and looks at the acoustic predictors it relies on most significantly for making decisions. Downton Abbey is chosen as a source of speech for our study, under the assumption that dialogues produced by professional actors are natural, and the emotions contained therein are spontaneous enough to resemble those in real life. A two-level model, comprising text and speech levels, is designed to ultimately train an acoustic emotion detector. On the first level, a textual sentiment analyser is used to assign emotion labels to the subtitles of the Downton Abbey episodes. The second level then takes in the emotion-annotated subtitles, and learns to detect emotions in the corresponding speech data. Successful learning depends on (a) correct time alignment of subtitles and corresponding spoken lines, and (b) reliable labels of emotions on the subtitles. One can analyse the trained model to understand which acoustic features best predict anger. On the first level, an off-the-shelf textual sentiment analyser did not work optimally, as the test data – the subtitles - appeared to be different in terms of language used from the linguistic material in the training set. Also, anger was defined too broadly in the training, beyond manifested anger detectable in speech. To tackle this issue, 10,000 lines of subtitles were manually annotated to train a dedicated textual sentiment analyser. Another major source of complexity comes from the spontaneous nature of speech, where background music, noise, and overlapping voices exist. Despite the use of different audio preprocessing methods, ranging from applying filters to pre-processing with deep learning models, the quality of speech remained mediocre for automatic treatment. Furthermore, as the machine learning models engaged in this study do not take raw audio as input, sets of acoustic features are extracted. These features stretch from simple phonetic descriptors, like intensity or pitch, to mel-frequency cepstral coefficients which account for humans’ non-linear audition. Since these spectral features used are highly susceptible to noise, the emotion information in speech may be obscured, which may hinder the performance of the acoustic anger detector. A comparative study on two similar emotions, namely, fear and disgust, is also conducted to investigate the different acoustic correlates of emotions. It is found that the significant features in the detection are vastly different from those of anger. The thesis offers insights into the roles of various acoustic features in the detection of anger. It also suggests that neural networks may require highly sophisticated fine-tuning to be capable of learning from a highly unbalanced dataset such as the one employed in the thesis.	-
dc.language	eng	-
dc.publisher	The University of Hong Kong (Pokfulam, Hong Kong)	-
dc.relation.ispartof	HKU Theses Online (HKUTO)	-
dc.rights	The author retains all proprietary rights, (such as patent rights) and the right to use in future works.	-
dc.rights	This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.	-
dc.subject.lcsh	Anger	-
dc.subject.lcsh	Audio data mining	-
dc.subject.lcsh	Machine learning	-
dc.title	Anger detection in speech data using machine learning methods	-
dc.type	PG_Thesis	-
dc.description.thesisname	Master of Philosophy	-
dc.description.thesislevel	Master	-
dc.description.thesisdiscipline	Linguistics	-
dc.description.nature	published_or_final_version	-
dc.date.hkucongregation	2023	-
dc.identifier.mmsid	991044745658103414	-

File Download

Supplementary

postgraduate thesis: Anger detection in speech data using machine learning methods

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats