Abstract:
With the recent developments in speech synthesis via machine learning, this study
explores the incorporation of linguistics knowledge to visualise and evaluate synthetic
speech model training. Formant frequencies in speech provide information on the shape
of the human vocal tract that can differentiate one voice from another. An understanding
of formants helps evaluate sound change with respect to time, such as accents, speaking
and hearing differences in speech, by providing a means to measure vowel sounds. These
vowels can be visualised by plotting the first and the second formant frequencies for
each vowel against one another to achieve a vowel space. Formants are a topic that
is heavily researched and developed in linguistics for human speech however, has not
been well explored in the realm of machine learning and text-to-speech synthesis. If
changes to the first and second formant (in turn, the vowel space) can be seen and
heard in synthetic speech, then this knowledge can inform speech synthesis technology
developers. To identify if the changes in the vowel space of synthetic speech can be
seen and heard, a speech synthesis model trained on a large General American English
database was fine-tuned into a New Zealand English voice. The vowel spaces at different
intervals during the fine-tuning were analysed to determine if the model learned the
New Zealand English vowel space. Our findings based on vowel space analysis show that
we can visualise how a speech synthesis model learns the vowel space of the database
it is trained on. The results show that a model is able to successfully adapt the shape
of a vowels space between accents and even speakers. Perception tests confirmed that
humans can perceive when a speech synthesis model has learned characteristics of the
speech database it is training on. The methods used in this study have also been made
into an open-source package for use in future research. Similarly, the research conducted
in this study can help speech technology developers by providing a supplementary means
of evaluation to a perceptual analysis to provide intermediary feedback and limit the
need for repeated perception tests.