Semiparametric maximum likelihood for multi-phase response-selective sampling and missing data problems
Reference
Degree Grantor
Abstract
Missing data is a common occurrence in most medical (or other) research data collection processes. Missing data patterns can be sometimes caused by design. In multi-phase response-selective sampling, the response variable Y and some easily-obtainable variables are fully observed for a first-phase sample or finite population. Some of the covariates of interest, X, which might be difficult or too expensive to obtain, are then observed at later phases for smaller subsamples. The selection of data at each phase is dependent on the categorical information of Y , with the choice of considering other fullyobserved discrete variables as a cross- or post-stratification. In this thesis, we start from the simple case-control study in which the data are collected under a two-phase response-selective sampling scheme. Special conditions are considered subsequently to pose different questions of interest. In a secondary study, we may be interested in another binary response variable Y2, which is associated with the original case-control response Y in the data. Conventional logistic regression analysis can no longer provide consistent parameter estimates in this case. We also consider the situation in which the case-control status of each subject is actually defined by dichotomising a continuous variable which is potentially available in the population. Ignoring this source of information as in binary logistic regression typically results in a loss of efficiency which may be substantial. Linear regression analysis can be carried out to efficiently estimate the logistic model odds-ratios and their 95% confidence intervals. The behaviour of various methods of analyses and sampling strategies for linear models are also discussed. We finally consider three-phase response-selective sampling designs, methods of analyses and some applications of three-phase methods. Our main approach for data analysis is semiparametric maximum likelihood. Survey-weighted methods, as well as other semiparametric approaches, are also considered for comparison. Their relative efficiencies and robustness properties are investigated using a wide range of simulation studies and real data analyses. The major conclusions for each type of analyses are as follows. When we have correlated bivariate outcomes in case-control studies, both the semiparametric maximum likelihood (SPML) method with joint modeling of bivariate responses, and the simple weighted method for the model of interest, can be used. The SPML method is often substantially more efficient than the weighted method, but subject to bias in estimating parameters of interest when its nuisance models have been misspecified. The weighted method needs no nuisance models and thus is robust in this regard, but we cannot tell when it’s going to be very inefficient without sophisticated modeling as through the SPML method. We suggest to use both at this situation. Conclusions borne out by both types of analyses are particularly credible. For the type of case-control studies in which the binary response is potentially available as, or categorised from a continuous variable, a linear regression analysis with logistic error distribution using this continuous variable as the response is suggested. The results can be easily coverted into estimates of binary regression parameters with substantially higher efficiency. When such linear regression analyses are of particular interest, both the twophase SPML method and the weighted method (with small modifications) are used to analyse data arising from various two-phase sampling designs. Introducing extra stratification in sampling as well as in the analysis can substantically increase efficiencies of both methods in estimating parameters of X strongly associated with a variable defining strata (say V). The SPML method is always efficient compared with the weighted method. Differences in efficiency are most noticeable under response-selective sampling, which is, however, precisely where the SPML method is least robust under error model misspecification. The weighted method is robust in this regard when a logistic error distribution is used, and also reasonably efficient under Vconditional sampling. We finally consider three-phase response-selective sampling and have developed appropriate three-phase SPML method to obtain efficient and consistent parameter estimates under arbitrary regression models. As a natural extention to Scott and Wild’s two-phase SPML method, this approach is always efficient compared with other semiparametric approaches. Its drawbacks are its complexity and its slow convergence in some situations. This can be simplified, however, in case-control studies using logistic regression analysis. Although a general introduction to the missing data literature is given in the first chapter, in this thesis, we deal only with some very structured problems which are versions of the multi-phase sampling problems discussed in Section 1.2. The survey of the broader field of missing data is given to put the work done here in its wider context. Happenstance missing values in example data are directly deleted for simplicity. More sophisticated methods for missing data could be used but would require further methodological development and therefore not be considered in this thesis.