Visualising Model Training via Vowel Space for Text-To-Speech Systems

Abeysinghe, Binu

Visualising Model Training via Vowel Space for Text-To-Speech Systems

Abeysinghe, Binu

Identifier: https://hdl.handle.net/2292/61844

Issue Date: 2022

Degree Grantor: The University of Auckland

Rights: Copyright: the author

Rights (URI): https://researchspace.auckland.ac.nz/docs/uoa-docs/rights.htm

Abstract:

With the recent developments in speech synthesis via machine learning, this study explores the incorporation of linguistics knowledge to visualise and evaluate synthetic speech model training. Formant frequencies in speech provide information on the shape of the human vocal tract that can differentiate one voice from another. An understanding of formants helps evaluate sound change with respect to time, such as accents, speaking and hearing differences in speech, by providing a means to measure vowel sounds. These vowels can be visualised by plotting the first and the second formant frequencies for each vowel against one another to achieve a vowel space. Formants are a topic that is heavily researched and developed in linguistics for human speech however, has not been well explored in the realm of machine learning and text-to-speech synthesis. If changes to the first and second formant (in turn, the vowel space) can be seen and heard in synthetic speech, then this knowledge can inform speech synthesis technology developers. To identify if the changes in the vowel space of synthetic speech can be seen and heard, a speech synthesis model trained on a large General American English database was fine-tuned into a New Zealand English voice. The vowel spaces at different intervals during the fine-tuning were analysed to determine if the model learned the New Zealand English vowel space. Our findings based on vowel space analysis show that we can visualise how a speech synthesis model learns the vowel space of the database it is trained on. The results show that a model is able to successfully adapt the shape of a vowels space between accents and even speakers. Perception tests confirmed that humans can perceive when a speech synthesis model has learned characteristics of the speech database it is training on. The methods used in this study have also been made into an open-source package for use in future research. Similarly, the research conducted in this study can help speech technology developers by providing a supplementary means of evaluation to a perceptual analysis to provide intermediary feedback and limit the need for repeated perception tests.

Show full item record