Classifying Emotions Across Variations in Speech Data

Keesing, Aaron

Classifying Emotions Across Variations in Speech Data

Keesing, Aaron

Identifier: https://hdl.handle.net/2292/66954

Issue Date: 2023

Degree Name: PhD

Degree Grantor: The University of Auckland

Rights: Copyright: The author

Rights (URI): https://researchspace.auckland.ac.nz/docs/uoa-docs/rights.htm

Abstract:

As speech-enabled technology becomes more prevalent, computers need to predict human emotions accurately. Speech emotion recognition (SER) predicts emotions using just speech, but current SER literature lacks standardisation in studying emotion recognition across variations in speech, such as language. These incompatible methodologies lead to incomparable results with limited generality. To work towards standardised testing, we propose and develop frameworks for within-domain and cross-domain evaluation. We use these frameworks to study emotion classification across a variety of datasets and to evaluate methods for improving cross-domain robustness based on multi-task learning and synthetic speech. First, we develop a framework for within-domain evaluation that employs multiple datasets to evaluate models and can be applied to any dataset, classifier and feature set. Our framework uses a full factorial design to examine overall trends and interactions between datasets, classifiers and features. We employ it to evaluate several feature sets, particularly pre-trained embeddings, and classifiers for emotion recognition. Our results highlight diminishing returns when training deep neural networks from scratch instead of using pre-trained embeddings as features; such embeddings encode more emotional information than commonly used feature sets. Our cross-domain framework builds on our within-domain framework, allowing us to explore how recognition rates change when training and test data are from different domains. We combine datasets that differ in a single factor such as language and country of origin, but are similar in other properties, and evaluate cross-domain with these datasets. We demonstrate our framework with cross-lingual, cross-country and cross-corpus recognition. Our results show that the domains involved have less effect on accuracy than the elicitation method, and individual emotion accuracies vary more than overall accuracy, thus highlighting the importance of test on different combinations of datasets. Finally, we use our cross-domain framework to evaluate techniques for improving model robustness. Our multi-task learning techniques consider tasks either as subgroups of the training data or as auxiliary variables to classify. Our synthetic speech framework adds emotional expressions into neutral target speech using emotional voice conversion. Our evaluations show that these techniques can increase cross-domain accuracy, but are more effective where there is a larger cross-domain gap.

Show full item record