Robotic Speech Recognition System
Reference
Degree Grantor
Abstract
Speech recognition systems are the most natural means of communication with machines. Yet, the application of these systems in real robotic platforms is still not widespread due to several factors such as speed and accuracy of speech decoding. In this thesis, we addressed these factors from several perspectives and suggested a possible solution to each of them. There are three main objectives of this thesis. The first objective is to develop a simple speech decoder based on the recent advances in the field of speech recognition. This decoderis based on a static search space constructed using weighted finite-state transducers. The decoding mechanism is based on the Viterbi beam pruning and implemented using the token passing technique. To improve the speed of the developed decoder, we employed two approaches, namely likelihood caching and histogram pruning. As the developed decoder provides the primitive functions necessary for achieving real-time speech recognition, it can be viewed as a seed for future additions and improvements. In comparison with other speech decoders, such as Sphinx3 and HDecode, the developed decoder performs better when tested on the evaluation set of the Resource Management (RM1) speech corpus. The second objective is to improve the speech decoding accuracy. In this regard, two approaches have been proposed. The first approach is based on optimizing the language model parameters on transducer-based decoding graphs using sentence-level transition update. The decoding accuracy using the proposed method outperforms that of the common approach based on the word-pairs transition update, when tested on both TIMIT and RM1 speech corpora. In the second approach, a new framework is proposed for jointly optimizing the parameters of acoustic and language models on transducer-based decoding graphs. This framework exploits the inherent correlation between the acoustic and language models, and thus achieved better decoding accuracy when compared with separate optimization of the acoustic and language models. The third objective is to improve the command decoding accuracy. This objective is realized through proposing a new approach for extracting tiny decoding graphs corresponding to the potential spoken commands used in human-robot interaction. This approach is based on merging the traditional grammar rules with n-gram models to produce an elegant and tiny decoding graph suitable for single-pass decoding. This approach significantly improved the command decoding accuracy when compared with the traditional grammar and n-gram based approaches in benchmark testing of the command and control corpus, RM1. Additionally, a significant improvement in the decoding speed as well as a significant reduction in the required memory have been achieved, which emphasizes the effectiveness of the proposed approach for controlling service robots. Apart from these objectives, we propose a systematic approach for developing robotic speech recognition systems based on the tripodal schematic architecture. This approach is presented in terms of the developed speech recognition system, which is based on multi-threads and multibuffers, as a case study. The advantage of the proposed approach is that it can be used to guide software engineers in developing robotic speech recognition systems in an optimal and systematic way.