Abstract:
This thesis addresses the problem of determining both the transfer function of the vocal tract and the air volume-velocity in the glottis by analysing voiced speech sounds. The study examines a number of methods which are based on a parameteric model of the speech waveform. The distinctive feature of this model is that it incorporates an explicit representation of the glottal volume-velocity waveform. The purpose of the study is two-fold. It seeks firstly to develop an algorithm that is able to produce accurate estimates when applied to synthetic speech generated in accordance with the assumed speech model. The second aim is to evaluate the accuracy of this model, or more specifically the accuracy of the associated model of glottal flow. This is achieved by analysing real speech. The investigation begins by carrying out a detailed evaluation of a method proposed by Milenkovic. This method requires that initial estimates of both the glottal endpoints and the vocal tract transfer function be provided, and updates these recursively. Estimates of the remaining parameters of the glottal flow model are also produced by this procedure. In contrast to previous reports, the tests carried out here show conclusively that the method performs very poorly. In the light of these results, three new methods have been developed in this study, although based initially on Milenkovic's model of the glottal flow. One of these is similar to Milenkovic's, but differs in the method of updating the initial estimates of the glottal endpoints and the transfer function parameters. This method in many cases produces much better results than Milenkovic's, but performs poorly for sounds of moderate to high pitch. It is argued that this occurs because the technique, like Milenkovic's method, depends on an initial transfer function estimate obtained by Linear Predictive Coding, which is known to perform poorly at high pitch. In order to avoid this problem, two further methods are presented which do not require an initial transfer function estimate. These methods both employ a grid search together with a simplex search to estimate the glottal endpoints, while the other parameters are estimated by the linear least squares method. The two methods differ only in the organisation of the search strategy. Both give extremely good results when applied to synthetic speech, although one involves less computational expense, and is also judged likely to be more robust under real analysis conditions. In the final part of the study, this method is applied to several real sounds in order to evaluate the accuracy of the speech model. Although it is more difficult to objectively evaluate the results of these analyses, they appear to be quite realistic. There is however some uncertainty as to the accuracy of the glottal flow estimates, because of the extent to which they are constrained by the form of Milenkovic's model. To investigate this, a new glottal model has been developed and incorporated into the algorithm. Estimates obtained using this new model in some cases show a much more gradual increase in volume-velocity as the glottis opens than is possible using Milenkovic's model. This appears to be more consistent with other published data. In addition, the analyses based on this model also give slightly increased values of prediction gain, and from this it is concluded that the model and the estimates which it produces are superior.