Abstract:
A main focus in state-of-the-art automatic speaker identification systems is finding efficient features carrying speakers' attributes from speech signal. However a major deficiency of the commonly used features is lack of robustness in mismatched environments. These features are sensitive to various corrupted acoustic conditions and easily distorted by additive noises. The objective of this dissertation has been to develop effective and robust feature extraction algorithms for speaker identification systems in noisy-mismatched environments. The initial proposed algorithm relies on processing a single channel speech signal and modeling the human auditory system. The processing includes Gammatone auditory bandpass filtering of speech signal, half-wave rectification, and A-law compression to model the effects of the auditory system periphery. Three features are extracted by applying Independent Component Analysis (ICA) to the frequency, cepstral and autocorrelogram domains of the compressed output signals respectively. Speaker identification tasks are carried out to investigate these features using both live and loudspeaker recording speech corpus. Experiments show that these features can well denote the distribution of speakers and are robust to additive noises. Among them, the feature extracted in autocorrelogram achieves the best identification performance in noisy-mismatched environments. Inspired by the above methods, a feature extraction algorithm based on human binaural model is developed. A pair of microphones is used to replicate human ears in the processing. Cross-correlation processing is taken of the microphone outputs after Gammatone bandpass filtering, rectification and compression. ICA is then applied to the real cepstrum of the correlated waveform to extract the dominant components from each frequency band. Therefore the effect of speech-uncorrelated noisy-mismatched acoustic conditions is minimized by using cross-correlation and ICA. The performance evaluation of the extracted feature is measured with respect to various processing blocks of our proposed algorithm, which gives us a better understanding of how specific design choices and parameter values contribute to the proposed algorithm. Our proposed features are tested with our developed speech corpus based on TIMIT database. The created corpus includes 188 speakers producing 12 seconds training and 5 seconds testing utterances. Different interference and background noises with various signal-to-noise ratio (SNR) levels are generated using loudspeakers while recoding the speech signals. The database recording is prepared using two-microphone array. All the speech utterances are recorded in two different rooms; a non-reverberant whisper room and a normal office room to investigate how the room reverberation affects the proposed features. The recording is also repeated using an acoustic artificial head, which is an ideal recording device which mimics the human ear characteristic, replicates the human hearing behaviours on the sound diffraction and reflection, and emulates the transfer function of the human ear according to the sound direction. Based on these databases, an implementation of the text-independent speaker identification system is presented. Experimental results indicate that the proposed algorithm has achieved significant improvements in the identification performance in several experimental setups compared with commonly used features.