Principal Components Analysis in the Age of Large Genetic Datasets and Rare Variants

Barnett, Daniel

Principal Components Analysis in the Age of Large Genetic Datasets and Rare Variants

Barnett, Daniel

Identifier: http://hdl.handle.net/2292/36304

Issue Date: 2017

Degree Grantor: The University of Auckland

Rights: Copyright: The author

Rights (URI): https://researchspace.auckland.ac.nz/docs/uoa-docs/rights.htm

Abstract:

Advances in sequencing technologies have signiﬁcantly lowered the barriers to collecting large amounts of human DNA data for association studies, including rarer variation not usually genotyped with prior methods. Identifying population structure is an essential part of these studies to minimise confounding, for which principal components analysis (PCA) is a common technique. Rare variants have been thought to characterise ﬁner-scale population structure, but classical PCA methods are unsuitable for the large matrix sizes collected and the optimal weighting of rare variants is not clear. In this thesis, we investigate these issues. We implement two eﬃcient stochastic algorithms and one streaming algorithm for performing PCA in R, and compare these to classical algorithms in terms of accuracy and speed. Accuracy comparisons assess the resulting PCAs’ similarity to results from the classical methods in terms of both overall subspace and individual dimensions, whereas present literature only examine one or the other. To investigate the eﬀect of rare variants on population structure estimates from PCA, we use three weightings for individual variants (unit-variance, uniform, and Beta distribution) on a simulated DNA sequence and data from the 1000 Genomes Project. The ability of each weighting to recover the true population structure using either the entire sequence or subsets of rare or common variants is compared. Each set of principal components are used as covariates in association tests of a phenotype based on population membership and a set of variants, and the eﬀectiveness of these at reducing confounding is compared to using true population membership as a covariate. We ﬁnd the stochastic and streaming algorithms provide signiﬁcant reductions in the computational burden of PCA compared with classical algorithms while obtaining essentially identical results given appropriate parameters. Recent demographic events are unable to be identiﬁed with common variants alone in simulations. However, we ﬁnd rare variants distort principal components extracted from the 1000 Genomes dataset while enabling the identiﬁcation of an outlier. The inclusion of rare variants and using unit-variance or Beta distribution weights provide covariates closest to using true population membership in association tests.

Description:

Full text is available to authenticated members of The University of Auckland only.

Show full item record

Files in this item

Name: whole.pdf

Size: 15.21Mb

Format: PDF

This item appears in the following Collection(s)

Masters Theses - Authenticated Access [6749]

Principal Components Analysis in the Age of Large Genetic Datasets and Rare Variants

Principal Components Analysis in the Age of Large Genetic Datasets and Rare Variants

Abstract:

Description:

Files in this item

This item appears in the following Collection(s)

Search ResearchSpace

Browse

All of ResearchSpace

This Collection

Statistics

Principal Components Analysis in the Age of Large Genetic Datasets and Rare Variants

Principal Components Analysis in the Age of Large Genetic Datasets and Rare Variants

Abstract:

Description:

Files in this item

This item appears in the following Collection(s)

Share

Search ResearchSpace

Browse

All of ResearchSpace

This Collection

Statistics