Abstract:
As speech-enabled technology becomes more prevalent, computers need to predict human
emotions accurately. Speech emotion recognition (SER) predicts emotions using just
speech, but current SER literature lacks standardisation in studying emotion recognition
across variations in speech, such as language. These incompatible methodologies lead
to incomparable results with limited generality. To work towards standardised testing,
we propose and develop frameworks for within-domain and cross-domain evaluation. We
use these frameworks to study emotion classification across a variety of datasets and to
evaluate methods for improving cross-domain robustness based on multi-task learning and
synthetic speech.
First, we develop a framework for within-domain evaluation that employs multiple
datasets to evaluate models and can be applied to any dataset, classifier and feature
set. Our framework uses a full factorial design to examine overall trends and interactions
between datasets, classifiers and features. We employ it to evaluate several feature sets,
particularly pre-trained embeddings, and classifiers for emotion recognition. Our results
highlight diminishing returns when training deep neural networks from scratch instead
of using pre-trained embeddings as features; such embeddings encode more emotional
information than commonly used feature sets.
Our cross-domain framework builds on our within-domain framework, allowing us to
explore how recognition rates change when training and test data are from different domains.
We combine datasets that differ in a single factor such as language and country of
origin, but are similar in other properties, and evaluate cross-domain with these datasets.
We demonstrate our framework with cross-lingual, cross-country and cross-corpus recognition.
Our results show that the domains involved have less effect on accuracy than the
elicitation method, and individual emotion accuracies vary more than overall accuracy,
thus highlighting the importance of test on different combinations of datasets.
Finally, we use our cross-domain framework to evaluate techniques for improving model
robustness. Our multi-task learning techniques consider tasks either as subgroups of the
training data or as auxiliary variables to classify. Our synthetic speech framework adds
emotional expressions into neutral target speech using emotional voice conversion. Our evaluations show that these techniques can increase cross-domain accuracy, but are more
effective where there is a larger cross-domain gap.