How Well Does Your Phylogenetic Model Fit Your Data? Assessing Goodness of Fit in Phylogenetic Inference
Reference
Degree Grantor
Abstract
The test for model-to-data fitness is a fundamental principle within the statistical sciences. The purpose of such a test is to assess whether the selected best-fitting model adequately describes the behaviour in the data. Despite their broad application across many areas of statistics, goodness of fit tests for phylogenetic models have received much less attention than model selection methods in the last decade. At present a number of approaches have been suggested. However, these are often flawed, with problems ranging from the presence of systematic error in the models themselves, to the difficulties presented by the nature of phylogenetic data. Ultimately these problems lead to an inadequate choice of statistic. This is one of the main reasons why assessment of model-to-data fit is often a neglected step within phylogenetic analysis. In this work, we explore new and existing methods for goodness of fit assessment, to address some of the many open-ended questions around model-fit adequacy in phylogenetic inference.We assess different aspects of summary statistics and residual diagnostic tools, test the power and significance of established statistics, adapt statistics from other fields to the phylogenetic framework, and test the utility of combining multiple statistics to diagnose areas of poor model-fit violations. Statistics were explored and tested within both the frequentist and Bayesian frameworks, with each approach addressing different facets of the fitness landscape. Overall, we provide new insights into some of the current challenges facing goodness of fit assessment in phylogenetics. We argue how the peculiarities of phylogenetic data challenge the development of robust tools, requiring a complex combination of statistical and biological formulation.