Abstract:
Healthcare robots that interact with humans are increasingly common these days. Many
of these social robots can interact with humans using their voice. The synthesised voice
of a social robot can impact on its acceptance by humans. This thesis identified the
type of voice that is needed for a healthcare robot interacting with humans, which was
the first research question. Once the type of voice needed for a healthcare robot was
identified, then the second research question aimed to synthesise the voice.
To identify the type of voice needed for a healthcare robot, a perception test was
conducted. In this thesis, the focus is on empathetically speaking social robots. Here,
empathy was included only via prosody modelling of the speech signal. In the perception
test, the participants were asked questions related to their preference for an empathetically speaking healthcare robot. With the responses from the participants and a detailed
analysis of a particular healthcare robot’s dialogues, the emotions needed for an empathetic voice were identified. To address the second research question, prosody features
of speech were modelled for the emotions identified by addressing research question 1.
Further, to evaluate the synthesised emotional speech, perception tests were conducted.
The results obtained suggest that people prefer empathetically speaking healthcare
robots. One of the major findings of this thesis is that the emotions needed for an
empathetic voice are not only the primary emotions but also some secondary emotions
– anxious, apologetic, confident, enthusiastic, worried. The development and acoustic
analysis of an emotional speech corpus with these secondary emotions was done. Based
on this acoustic analysis, the fundamental frequency contour was parametrically modelled.
The speech rate and mean intensity were modelled using rules. Ensemble regression
was used to predict these three prosody features for each of the secondary emotions.
Using these prediction models and a Hidden Markov Model-based speech synthesis
approach, secondary emotions were synthesised. Results of the perception test showed
that participants could perceive the secondary emotions. Finally, to tie the whole story
together,the participants could perceive high levels of empathy from a healthcare robot
speaking with a synthesised voice containing the five secondary emotions.
In short, the emotions needed for an empathetically speaking healthcare robot were
identified. These emotions were secondary emotions, and they were synthesised by
modelling prosody features using machine learning. Finally, participants agreed that
they could perceive empathy from the synthesised voice of a healthcare robot.