File Download
Supplementary
-
Citations:
- Appears in Collections:
postgraduate thesis: Anger detection in speech data using machine learning methods
Title | Anger detection in speech data using machine learning methods |
---|---|
Authors | |
Issue Date | 2023 |
Publisher | The University of Hong Kong (Pokfulam, Hong Kong) |
Citation | Wong, C. L. E. [黃子諾]. (2023). Anger detection in speech data using machine learning methods. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR. |
Abstract | The advancement of technology has enabled a surge of speech data to be created, stored, and analysed. Speech, being central to human communications, is considered more direct than writing in expressing emotions. However, the specific research on emotions in speech is still in a relatively early phase. The noisy nature of spontaneous speech complicates its analysis. Rooted in the detection of anger in spontaneous speech, the current thesis compares different machine learning models to choose the best performing one for the detection of anger, and looks at the acoustic predictors it relies on most significantly for making decisions.
Downton Abbey is chosen as a source of speech for our study, under the assumption that dialogues produced by professional actors are natural, and the emotions contained therein are spontaneous enough to resemble those in real life. A two-level model, comprising text and speech levels, is designed to ultimately train an acoustic emotion detector. On the first level, a textual sentiment analyser is used to assign emotion labels to the subtitles of the Downton Abbey episodes. The second level then takes in the emotion-annotated subtitles, and learns to detect emotions in the corresponding speech data. Successful learning depends on (a) correct time alignment of subtitles and corresponding spoken lines, and (b) reliable labels of emotions on the subtitles. One can analyse the trained model to understand which acoustic features best predict anger.
On the first level, an off-the-shelf textual sentiment analyser did not work optimally, as the test data – the subtitles - appeared to be different in terms of language used from the linguistic material in the training set. Also, anger was defined too broadly in the training, beyond manifested anger detectable in speech. To tackle this issue, 10,000 lines of subtitles were manually annotated to train a dedicated textual sentiment analyser. Another major source of complexity comes from the spontaneous nature of speech, where background music, noise, and overlapping voices exist. Despite the use of different audio preprocessing methods, ranging from applying filters to pre-processing with deep learning models, the quality of speech remained mediocre for automatic treatment. Furthermore, as the machine learning models engaged in this study do not take raw audio as input, sets of acoustic features are extracted. These features stretch from simple phonetic descriptors, like intensity or pitch, to mel-frequency cepstral coefficients which account for humans’ non-linear audition. Since these spectral features used are highly susceptible to noise, the emotion information in speech may be obscured, which may hinder the performance of the acoustic anger detector.
A comparative study on two similar emotions, namely, fear and disgust, is also conducted to investigate the different acoustic correlates of emotions. It is found that the significant features in the detection are vastly different from those of anger.
The thesis offers insights into the roles of various acoustic features in the detection of anger. It also suggests that neural networks may require highly sophisticated fine-tuning to be capable of learning from a highly unbalanced dataset such as the one employed in the thesis. |
Degree | Master of Philosophy |
Subject | Anger Audio data mining Machine learning |
Dept/Program | Linguistics |
Persistent Identifier | http://hdl.handle.net/10722/335570 |
DC Field | Value | Language |
---|---|---|
dc.contributor.author | Wong, Chi Lok Enoch | - |
dc.contributor.author | 黃子諾 | - |
dc.date.accessioned | 2023-11-30T06:22:41Z | - |
dc.date.available | 2023-11-30T06:22:41Z | - |
dc.date.issued | 2023 | - |
dc.identifier.citation | Wong, C. L. E. [黃子諾]. (2023). Anger detection in speech data using machine learning methods. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR. | - |
dc.identifier.uri | http://hdl.handle.net/10722/335570 | - |
dc.description.abstract | The advancement of technology has enabled a surge of speech data to be created, stored, and analysed. Speech, being central to human communications, is considered more direct than writing in expressing emotions. However, the specific research on emotions in speech is still in a relatively early phase. The noisy nature of spontaneous speech complicates its analysis. Rooted in the detection of anger in spontaneous speech, the current thesis compares different machine learning models to choose the best performing one for the detection of anger, and looks at the acoustic predictors it relies on most significantly for making decisions. Downton Abbey is chosen as a source of speech for our study, under the assumption that dialogues produced by professional actors are natural, and the emotions contained therein are spontaneous enough to resemble those in real life. A two-level model, comprising text and speech levels, is designed to ultimately train an acoustic emotion detector. On the first level, a textual sentiment analyser is used to assign emotion labels to the subtitles of the Downton Abbey episodes. The second level then takes in the emotion-annotated subtitles, and learns to detect emotions in the corresponding speech data. Successful learning depends on (a) correct time alignment of subtitles and corresponding spoken lines, and (b) reliable labels of emotions on the subtitles. One can analyse the trained model to understand which acoustic features best predict anger. On the first level, an off-the-shelf textual sentiment analyser did not work optimally, as the test data – the subtitles - appeared to be different in terms of language used from the linguistic material in the training set. Also, anger was defined too broadly in the training, beyond manifested anger detectable in speech. To tackle this issue, 10,000 lines of subtitles were manually annotated to train a dedicated textual sentiment analyser. Another major source of complexity comes from the spontaneous nature of speech, where background music, noise, and overlapping voices exist. Despite the use of different audio preprocessing methods, ranging from applying filters to pre-processing with deep learning models, the quality of speech remained mediocre for automatic treatment. Furthermore, as the machine learning models engaged in this study do not take raw audio as input, sets of acoustic features are extracted. These features stretch from simple phonetic descriptors, like intensity or pitch, to mel-frequency cepstral coefficients which account for humans’ non-linear audition. Since these spectral features used are highly susceptible to noise, the emotion information in speech may be obscured, which may hinder the performance of the acoustic anger detector. A comparative study on two similar emotions, namely, fear and disgust, is also conducted to investigate the different acoustic correlates of emotions. It is found that the significant features in the detection are vastly different from those of anger. The thesis offers insights into the roles of various acoustic features in the detection of anger. It also suggests that neural networks may require highly sophisticated fine-tuning to be capable of learning from a highly unbalanced dataset such as the one employed in the thesis. | - |
dc.language | eng | - |
dc.publisher | The University of Hong Kong (Pokfulam, Hong Kong) | - |
dc.relation.ispartof | HKU Theses Online (HKUTO) | - |
dc.rights | The author retains all proprietary rights, (such as patent rights) and the right to use in future works. | - |
dc.rights | This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. | - |
dc.subject.lcsh | Anger | - |
dc.subject.lcsh | Audio data mining | - |
dc.subject.lcsh | Machine learning | - |
dc.title | Anger detection in speech data using machine learning methods | - |
dc.type | PG_Thesis | - |
dc.description.thesisname | Master of Philosophy | - |
dc.description.thesislevel | Master | - |
dc.description.thesisdiscipline | Linguistics | - |
dc.description.nature | published_or_final_version | - |
dc.date.hkucongregation | 2023 | - |
dc.identifier.mmsid | 991044745658103414 | - |