Abstract:
Human-robot interaction has progressed rapidly due to the evolutionary functions of emotions. Research on observing the characteristics of human emotions has been popular. A lot of past research has investigated the role of facial expressions. Research on the voice has lagged
behind in terms of explaining how different emotions affect the vocal mechanism, the acoustic cues
produced, and the nature of the processes that allow listeners to recognise affective states.
Emotions in short vowel segments of continuous speech were analysed specifically considering
formant frequencies and glottal features. The valence and arousal of emotions were considered to
distinguish emotions from the neutral speaking style. Two emotional speech databases are
investigated - the German emotional speech corpus Emo-DB and New Zealand English emotional speech
corpus JLCorpus. In addition to the big five emotions, this thesis introduces formant and glottal
analysis of five secondary emotions through the JLCorpus. The first (F1) and second (F2) formant
frequencies were calculated for the vowels and presented in a centroid vowel space plot for
analysis. Results indicated that high-arousal emotions accounted for a higher mean F1 and
low-arousal emotions resulted in a lower mean F1. Positive valence emotions correlated with higher
mean F2. Further statistical analysis revealed the significant impact emotion has on most vowels.
The glottal analysis aimed to investigate the voice quality of the various emotions to distinguish
them. The four long vowels from the sentences in the JLCorpus were inverse-filtered by Iterative
Adaptive Inverse Filtering and the estimates of the glottal flow obtained were parameterised. The
glottal features: open quotient, normalised amplitude quotient and H1-H2 were extracted and
analysed. NAQ and H1-H2 tended to differentiate between arousal and valence levels perceived for
all speakers. High arousal emotions were found to have higher NAQ and H1-H2 values in comparison to
neutral therefore having a breathy phonation when expressing the emotion. Low arousal emotions had
lower NAQ and H1-H2 values in contrast to neutral, thus having a creaky phonation when expressing
the emotion. In this study, open quotient was not found to be emotion-discriminative. The
combination of glottal and acoustic features produced better discrimination overall. Analysis of
discriminating feature sets used in the study reflects a clear indication that formant frequencies
and glottal features are vital components of emotional speech analysis. The thesis concludes by
discussing the implications of the results.