Two-phase subsampling for DNA sequencing with application to endangered species

Luo, Pei (Zoe)

Two-phase subsampling for DNA sequencing with application to endangered species

Luo, Pei (Zoe)

Identifier: https://hdl.handle.net/2292/67513

Issue Date: 2024

Degree Name: PhD

Degree Grantor: The University of Auckland

Rights: Copyright: The author

Rights (URI): https://researchspace.auckland.ac.nz/docs/uoa-docs/rights.htm

Abstract:

Whole-genome sequencing for New Zealand endangered parrot species kākāpō has been completed for the entire population. Despite the decreasing cost of DNA sequencing, this sort of effort is generally not feasible in conservation studies or large human cohorts. A cost-saving strategy is to obtain relatively inexpensive information for the whole sample, such as low-resolution genotype data, then resequence a small subsample from the original sample with higher resolution and use the combined data to infer the whole sample. Such sampling strategies are called two-phase sampling, where the initial sampling of the cohort is followed by a subsampling of the chosen individuals to be resequenced. This thesis explores the two classes of approaches to handling incomplete data in twophasing sampling designs under different situations. The first class of approaches is genotype imputation, which is a process of predicting the missing genotypes using low-resolution genotypes of the whole sample and high-resolution genotypes of the subsample. However, genotype imputation is much more complicated for endangered species than for well-studied species such as humans, livestock and other model organisms. Alternatively, statistical inference of model parameters under two-phase sampling designs can be carried out by maximum likelihood approaches that account for the missing mechanisms of the data, which is another class of approaches that I explore. In genetic association studies, the polygenic model is often used to describe the architecture of complex traits as it allows the possibility that thousands of variants could contribute to the phenotypic variation in the population. Under such a proposition, mixed models can be used to measure the genetic effect of a particular variant while attributing the remaining variation to the population correlation structure. In this thesis, I propose a weighted maximum likelihood approach for fitting mixed models that takes advantage of the fact that the kākāpō population relatedness structure is known, making it possible to incorporate the population covariance matrix rather than the sample covariance matrix into the model. The performance of the proposed method is evaluated using the kākāpō data and simulated data with a population structure similar to humans. Hence the method should provide a general solution for fitting mixed models under two-phase sampling designs in both endangered species and human populations.

Show full item record