Abstract:
Understanding stutter behaviour is key to getting an accurate likelihood statistic from DNA analysis. Many different methods have been explored to address understanding stutter, however none of these include up to date statistical left censored data methods. Methods discussed in this research include; a substitution method, censored regression with maximum likelihood estimation, and censored regression with Bayesian estimation. Stutter ratio and stutter height were utilised as the variables of interest and the behaviour of both were analysed. D22 was selected as the locus of interest as its stutter behaviour is different from all other loci, most importantly it tends to stutter more than other loci. Stutter behaviour for back stutter, double back stutter, and forward stutter was assessed and models created. All censored stutter data was skewed with long tails, so any normal distribution was quickly ruled out. The substitution technique was found to be successful in correctly modelling back stutter, where the substituted values were close to the values that would be expected to be seen. For forward and double back stutter these values were not accurate and hence the substitution technique was suboptimal. Overall the censored regression with maximum likelihood estimates did the best job at parameter estimation and modelling stutter behaviour. The Bayesian technique created estimates that were close to the maximum likelihood estimates. Results for back stutter height found that a Gamma Distribution censored regression model, which included both allele and allele height, could model stutter behaviour. Results for forward stutter height found that a Log Normal Distribution censored regression model, which included both allele and allele height, could model stutter behaviour. Results for double back stutter should not be applied as the percentage of censoring was too high (approximately 90%) to obtain meaningful results. Overall, back stutter, forward stutter and double back stutter behaved similar. Researchers should find that a simple Log Normal censored regression model with maximum likelihood estimates will appropriately model all three.