File Download
Supplementary

postgraduate thesis: Anger detection in speech data using machine learning methods

TitleAnger detection in speech data using machine learning methods
Authors
Issue Date2023
PublisherThe University of Hong Kong (Pokfulam, Hong Kong)
Citation
Wong, C. L. E. [黃子諾]. (2023). Anger detection in speech data using machine learning methods. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.
AbstractThe advancement of technology has enabled a surge of speech data to be created, stored, and analysed. Speech, being central to human communications, is considered more direct than writing in expressing emotions. However, the specific research on emotions in speech is still in a relatively early phase. The noisy nature of spontaneous speech complicates its analysis. Rooted in the detection of anger in spontaneous speech, the current thesis compares different machine learning models to choose the best performing one for the detection of anger, and looks at the acoustic predictors it relies on most significantly for making decisions. Downton Abbey is chosen as a source of speech for our study, under the assumption that dialogues produced by professional actors are natural, and the emotions contained therein are spontaneous enough to resemble those in real life. A two-level model, comprising text and speech levels, is designed to ultimately train an acoustic emotion detector. On the first level, a textual sentiment analyser is used to assign emotion labels to the subtitles of the Downton Abbey episodes. The second level then takes in the emotion-annotated subtitles, and learns to detect emotions in the corresponding speech data. Successful learning depends on (a) correct time alignment of subtitles and corresponding spoken lines, and (b) reliable labels of emotions on the subtitles. One can analyse the trained model to understand which acoustic features best predict anger. On the first level, an off-the-shelf textual sentiment analyser did not work optimally, as the test data – the subtitles - appeared to be different in terms of language used from the linguistic material in the training set. Also, anger was defined too broadly in the training, beyond manifested anger detectable in speech. To tackle this issue, 10,000 lines of subtitles were manually annotated to train a dedicated textual sentiment analyser. Another major source of complexity comes from the spontaneous nature of speech, where background music, noise, and overlapping voices exist. Despite the use of different audio preprocessing methods, ranging from applying filters to pre-processing with deep learning models, the quality of speech remained mediocre for automatic treatment. Furthermore, as the machine learning models engaged in this study do not take raw audio as input, sets of acoustic features are extracted. These features stretch from simple phonetic descriptors, like intensity or pitch, to mel-frequency cepstral coefficients which account for humans’ non-linear audition. Since these spectral features used are highly susceptible to noise, the emotion information in speech may be obscured, which may hinder the performance of the acoustic anger detector. A comparative study on two similar emotions, namely, fear and disgust, is also conducted to investigate the different acoustic correlates of emotions. It is found that the significant features in the detection are vastly different from those of anger. The thesis offers insights into the roles of various acoustic features in the detection of anger. It also suggests that neural networks may require highly sophisticated fine-tuning to be capable of learning from a highly unbalanced dataset such as the one employed in the thesis.
DegreeMaster of Philosophy
SubjectAnger
Audio data mining
Machine learning
Dept/ProgramLinguistics
Persistent Identifierhttp://hdl.handle.net/10722/335570

 

DC FieldValueLanguage
dc.contributor.authorWong, Chi Lok Enoch-
dc.contributor.author黃子諾-
dc.date.accessioned2023-11-30T06:22:41Z-
dc.date.available2023-11-30T06:22:41Z-
dc.date.issued2023-
dc.identifier.citationWong, C. L. E. [黃子諾]. (2023). Anger detection in speech data using machine learning methods. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.-
dc.identifier.urihttp://hdl.handle.net/10722/335570-
dc.description.abstractThe advancement of technology has enabled a surge of speech data to be created, stored, and analysed. Speech, being central to human communications, is considered more direct than writing in expressing emotions. However, the specific research on emotions in speech is still in a relatively early phase. The noisy nature of spontaneous speech complicates its analysis. Rooted in the detection of anger in spontaneous speech, the current thesis compares different machine learning models to choose the best performing one for the detection of anger, and looks at the acoustic predictors it relies on most significantly for making decisions. Downton Abbey is chosen as a source of speech for our study, under the assumption that dialogues produced by professional actors are natural, and the emotions contained therein are spontaneous enough to resemble those in real life. A two-level model, comprising text and speech levels, is designed to ultimately train an acoustic emotion detector. On the first level, a textual sentiment analyser is used to assign emotion labels to the subtitles of the Downton Abbey episodes. The second level then takes in the emotion-annotated subtitles, and learns to detect emotions in the corresponding speech data. Successful learning depends on (a) correct time alignment of subtitles and corresponding spoken lines, and (b) reliable labels of emotions on the subtitles. One can analyse the trained model to understand which acoustic features best predict anger. On the first level, an off-the-shelf textual sentiment analyser did not work optimally, as the test data – the subtitles - appeared to be different in terms of language used from the linguistic material in the training set. Also, anger was defined too broadly in the training, beyond manifested anger detectable in speech. To tackle this issue, 10,000 lines of subtitles were manually annotated to train a dedicated textual sentiment analyser. Another major source of complexity comes from the spontaneous nature of speech, where background music, noise, and overlapping voices exist. Despite the use of different audio preprocessing methods, ranging from applying filters to pre-processing with deep learning models, the quality of speech remained mediocre for automatic treatment. Furthermore, as the machine learning models engaged in this study do not take raw audio as input, sets of acoustic features are extracted. These features stretch from simple phonetic descriptors, like intensity or pitch, to mel-frequency cepstral coefficients which account for humans’ non-linear audition. Since these spectral features used are highly susceptible to noise, the emotion information in speech may be obscured, which may hinder the performance of the acoustic anger detector. A comparative study on two similar emotions, namely, fear and disgust, is also conducted to investigate the different acoustic correlates of emotions. It is found that the significant features in the detection are vastly different from those of anger. The thesis offers insights into the roles of various acoustic features in the detection of anger. It also suggests that neural networks may require highly sophisticated fine-tuning to be capable of learning from a highly unbalanced dataset such as the one employed in the thesis.-
dc.languageeng-
dc.publisherThe University of Hong Kong (Pokfulam, Hong Kong)-
dc.relation.ispartofHKU Theses Online (HKUTO)-
dc.rightsThe author retains all proprietary rights, (such as patent rights) and the right to use in future works.-
dc.rightsThis work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.-
dc.subject.lcshAnger-
dc.subject.lcshAudio data mining-
dc.subject.lcshMachine learning-
dc.titleAnger detection in speech data using machine learning methods-
dc.typePG_Thesis-
dc.description.thesisnameMaster of Philosophy-
dc.description.thesislevelMaster-
dc.description.thesisdisciplineLinguistics-
dc.description.naturepublished_or_final_version-
dc.date.hkucongregation2023-
dc.identifier.mmsid991044745658103414-

Export via OAI-PMH Interface in XML Formats


OR


Export to Other Non-XML Formats