Abstract:
The SCOPE (Screening for Pregnancy Endpoints) Project is a large scale international project which aims to screen first time mothers for pregnancy-related diseases. Several statistical methodologies have been developed to address issues associated with large datasets in the SCOPE project. Two dimensional polyacrylamide gel electrophoresis (2D PAGE) is used to identify differentially expressed which may be applied to biomarker discovery. A limitation of 2D PAGE is the inability to detect proteins if its expression level falls below the limits of detection. A likelihood model was proposed to address this issue by incorporating a mixture model which takes into account both detected and non-detected proteins. Simulation analyses showed that the likelihood model has higher statistical power than the standard statistical analyses. A global Bayesian model extended from the likelihood model was proposed which is able to identify groups of differentially expressed proteins simultaneously. Several global distributions are used to model the underlying relationship between individual spot parameters and these parameters are estimated by Markov chain Monte Carlo (MCMC) technique. The simulation analyses showed that the global model is able to accurately recover the underlying global distributions and identifies more differentially expressed proteins than the likelihood model. Several prediction models for uncomplicated pregnancy and preeclampsia were constructed using the clinical features in the SCOPE database. These models were constructed using either logistic regression or linear discriminant analysis with various variable selection procedures. Most of the clinical risk factors that the model selected had previously been reported in other studies. The performance of the models was measured by the area under the Receiver Operating Characteristic curve and there were no significant differences between these models. Two versions of hierarchical prediction frameworks were proposed which attempted to boost the accuracy of the prediction. A different disease endpoint was classified at each stage and different sets of variables were used for each stage of the hierarchy. The structure of the hierarchical framework was predetermined based on the clinical relationship between these disease endpoints. Despite partitioning the database, there were no significant improvements to the prediction performance over the non-hierarchical model.