Abstract:
Genome wide association studies (GWAS) are an essential tool in the biological and medical sciences for collecting the information needed to explore the genetic basis of polygenetic diseases. One of the largest problems is that the sample size must be very large in order to have any chance of detecting a relationship. The computational effort required to analyze data from GWAS experiments is challenging. The aim of this thesis is to describe a method, implemented in associated software, that will assist the researcher, after the data is analyzed, by using the genotypes and the results of the SNP analysis to find gene sets/pathways containing genes enriched with SNPs which can recover the structure of the study. The software tool, GWPS (Genome Wide Pathway Search), allows researchers to explore the relationship between clusters of SNPs, rather than single SNPs, and disease status (case or control). The data is converted to a gene score format using the genotype calls from the study and the results obtained from the statistical association analysis. The SNPs are clustered, using this gene score format, into groups which reflect the way they naturally map to genes. The genes are used to perform logistic regression to recover the initial structure of the experiment and are ranked accordingly. Two other tests are used to support the results of the regression analysis. A t-test of the sums of gene scores contained in a pathway is used to quickly judge the difference between the case group and the control group. A test for uniformity on the subset of SNPs mapping to genes included in a pathway allows the user to judge whether the collection of SNPs in a pathway is random or whether there are more statistically significant SNPs in the pathway than one would expect by random chance alone. All the methods are implemented in a convenient R package and a data set was analyzed to illustrate the effectiveness of GWPS.