Classifying Emotions Across Variations in Speech Data

Show simple item record

dc.contributor.advisor Koh, Yun Sing
dc.contributor.advisor Witbrock, Michael
dc.contributor.advisor Watson, Ian
dc.contributor.author Keesing, Aaron
dc.date.accessioned 2023-12-15T00:46:37Z
dc.date.available 2023-12-15T00:46:37Z
dc.date.issued 2023 en
dc.identifier.uri https://hdl.handle.net/2292/66954
dc.description.abstract As speech-enabled technology becomes more prevalent, computers need to predict human emotions accurately. Speech emotion recognition (SER) predicts emotions using just speech, but current SER literature lacks standardisation in studying emotion recognition across variations in speech, such as language. These incompatible methodologies lead to incomparable results with limited generality. To work towards standardised testing, we propose and develop frameworks for within-domain and cross-domain evaluation. We use these frameworks to study emotion classification across a variety of datasets and to evaluate methods for improving cross-domain robustness based on multi-task learning and synthetic speech. First, we develop a framework for within-domain evaluation that employs multiple datasets to evaluate models and can be applied to any dataset, classifier and feature set. Our framework uses a full factorial design to examine overall trends and interactions between datasets, classifiers and features. We employ it to evaluate several feature sets, particularly pre-trained embeddings, and classifiers for emotion recognition. Our results highlight diminishing returns when training deep neural networks from scratch instead of using pre-trained embeddings as features; such embeddings encode more emotional information than commonly used feature sets. Our cross-domain framework builds on our within-domain framework, allowing us to explore how recognition rates change when training and test data are from different domains. We combine datasets that differ in a single factor such as language and country of origin, but are similar in other properties, and evaluate cross-domain with these datasets. We demonstrate our framework with cross-lingual, cross-country and cross-corpus recognition. Our results show that the domains involved have less effect on accuracy than the elicitation method, and individual emotion accuracies vary more than overall accuracy, thus highlighting the importance of test on different combinations of datasets. Finally, we use our cross-domain framework to evaluate techniques for improving model robustness. Our multi-task learning techniques consider tasks either as subgroups of the training data or as auxiliary variables to classify. Our synthetic speech framework adds emotional expressions into neutral target speech using emotional voice conversion. Our evaluations show that these techniques can increase cross-domain accuracy, but are more effective where there is a larger cross-domain gap.
dc.publisher ResearchSpace@Auckland en
dc.relation.ispartof PhD Thesis - University of Auckland en
dc.rights Items in ResearchSpace are protected by copyright, with all rights reserved, unless otherwise indicated.
dc.rights.uri https://researchspace.auckland.ac.nz/docs/uoa-docs/rights.htm en
dc.rights.uri http://creativecommons.org/licenses/by/3.0/nz/
dc.title Classifying Emotions Across Variations in Speech Data
dc.type Thesis en
thesis.degree.discipline Computer Science
thesis.degree.grantor The University of Auckland en
thesis.degree.level Doctoral en
thesis.degree.name PhD en
dc.date.updated 2023-12-14T21:35:06Z
dc.rights.holder Copyright: The author en
dc.rights.accessrights http://purl.org/eprint/accessRights/OpenAccess en


Files in this item

Find Full text

This item appears in the following Collection(s)

Show simple item record

Share

Search ResearchSpace


Browse

Statistics