dc.description.abstract |
Missing data is a common occurrence in most medical (or other) research
data collection processes. Missing data patterns can be sometimes caused by
design. In multi-phase response-selective sampling, the response variable Y
and some easily-obtainable variables are fully observed for a first-phase sample
or finite population. Some of the covariates of interest, X, which might
be difficult or too expensive to obtain, are then observed at later phases for
smaller subsamples. The selection of data at each phase is dependent on
the categorical information of Y , with the choice of considering other fullyobserved
discrete variables as a cross- or post-stratification.
In this thesis, we start from the simple case-control study in which the data
are collected under a two-phase response-selective sampling scheme. Special
conditions are considered subsequently to pose different questions of interest.
In a secondary study, we may be interested in another binary response variable
Y2, which is associated with the original case-control response Y in the
data. Conventional logistic regression analysis can no longer provide consistent
parameter estimates in this case. We also consider the situation in which
the case-control status of each subject is actually defined by dichotomising a
continuous variable which is potentially available in the population. Ignoring
this source of information as in binary logistic regression typically results
in a loss of efficiency which may be substantial. Linear regression analysis
can be carried out to efficiently estimate the logistic model odds-ratios and
their 95% confidence intervals. The behaviour of various methods of analyses
and sampling strategies for linear models are also discussed. We finally consider
three-phase response-selective sampling designs, methods of analyses
and some applications of three-phase methods.
Our main approach for data analysis is semiparametric maximum likelihood.
Survey-weighted methods, as well as other semiparametric approaches, are
also considered for comparison. Their relative efficiencies and robustness
properties are investigated using a wide range of simulation studies and real
data analyses.
The major conclusions for each type of analyses are as follows. When we
have correlated bivariate outcomes in case-control studies, both the semiparametric
maximum likelihood (SPML) method with joint modeling of bivariate
responses, and the simple weighted method for the model of interest,
can be used. The SPML method is often substantially more efficient than
the weighted method, but subject to bias in estimating parameters of interest
when its nuisance models have been misspecified. The weighted method
needs no nuisance models and thus is robust in this regard, but we cannot
tell when it’s going to be very inefficient without sophisticated modeling as
through the SPML method. We suggest to use both at this situation. Conclusions
borne out by both types of analyses are particularly credible.
For the type of case-control studies in which the binary response is potentially
available as, or categorised from a continuous variable, a linear regression
analysis with logistic error distribution using this continuous variable as
the response is suggested. The results can be easily coverted into estimates
of binary regression parameters with substantially higher efficiency.
When such linear regression analyses are of particular interest, both the twophase
SPML method and the weighted method (with small modifications)
are used to analyse data arising from various two-phase sampling designs.
Introducing extra stratification in sampling as well as in the analysis can
substantically increase efficiencies of both methods in estimating parameters
of X strongly associated with a variable defining strata (say V). The SPML
method is always efficient compared with the weighted method. Differences
in efficiency are most noticeable under response-selective sampling, which
is, however, precisely where the SPML method is least robust under error
model misspecification. The weighted method is robust in this regard when
a logistic error distribution is used, and also reasonably efficient under Vconditional
sampling.
We finally consider three-phase response-selective sampling and have developed
appropriate three-phase SPML method to obtain efficient and consistent
parameter estimates under arbitrary regression models. As a natural extention
to Scott and Wild’s two-phase SPML method, this approach is always
efficient compared with other semiparametric approaches. Its drawbacks are
its complexity and its slow convergence in some situations. This can be simplified,
however, in case-control studies using logistic regression analysis.
Although a general introduction to the missing data literature is given in
the first chapter, in this thesis, we deal only with some very structured problems
which are versions of the multi-phase sampling problems discussed in
Section 1.2. The survey of the broader field of missing data is given to put
the work done here in its wider context. Happenstance missing values in
example data are directly deleted for simplicity. More sophisticated methods
for missing data could be used but would require further methodological
development and therefore not be considered in this thesis. |
en |