Abstract:
Whole-genome sequencing for New Zealand endangered parrot species kākāpō has been
completed for the entire population. Despite the decreasing cost of DNA sequencing, this
sort of effort is generally not feasible in conservation studies or large human cohorts. A
cost-saving strategy is to obtain relatively inexpensive information for the whole sample,
such as low-resolution genotype data, then resequence a small subsample from the original
sample with higher resolution and use the combined data to infer the whole sample. Such
sampling strategies are called two-phase sampling, where the initial sampling of the cohort is
followed by a subsampling of the chosen individuals to be resequenced.
This thesis explores the two classes of approaches to handling incomplete data in twophasing sampling designs under different situations. The first class of approaches is genotype
imputation, which is a process of predicting the missing genotypes using low-resolution
genotypes of the whole sample and high-resolution genotypes of the subsample. However,
genotype imputation is much more complicated for endangered species than for well-studied
species such as humans, livestock and other model organisms.
Alternatively, statistical inference of model parameters under two-phase sampling designs can be carried out by maximum likelihood approaches that account for the missing
mechanisms of the data, which is another class of approaches that I explore. In genetic
association studies, the polygenic model is often used to describe the architecture of complex
traits as it allows the possibility that thousands of variants could contribute to the phenotypic
variation in the population. Under such a proposition, mixed models can be used to measure
the genetic effect of a particular variant while attributing the remaining variation to the
population correlation structure.
In this thesis, I propose a weighted maximum likelihood approach for fitting mixed
models that takes advantage of the fact that the kākāpō population relatedness structure
is known, making it possible to incorporate the population covariance matrix rather than
the sample covariance matrix into the model. The performance of the proposed method is
evaluated using the kākāpō data and simulated data with a population structure similar to
humans. Hence the method should provide a general solution for fitting mixed models under
two-phase sampling designs in both endangered species and human populations.