Abstract:
Missing data may arise due to happenstance, when some units fail to respond, or due to design, such as in multi-phase sampling schemes. The goal is to model Y in terms of a set of covariates X. Efficient analysis for multiphase studies can be obtained by maximizing the resulting (full) likelihood. However, if the final (and fully observed) sample is outcome-dependent, the resulting likelihood involves the marginal distribution of an often high-dimensional X . Modelling this marginal distribution may be hard or even unfeasible and methods that treat it non-parametrically are of interest. Semiparametric methods in which only the conditional distribution of Y given X is treated parametrically have been widely discussed in the literature. Most methods, however, estimate the probability of providing full information through a saturated model, which may only be possible in specific scenarios. Moreover, fully efficient methods often do not take into account extra information that is not part of the model of interest. These data are discarded and approaches that make a better use of the whole information are needed. Here we present a semiparametric method, denoted by CML+eS , that copes with both situations. We first showed that it is consistent and asymptotically normal under mild conditions and later performed extensive simulated studies, for both discrete and continuous responses. Our simulations showed substantially gains in efficiency when extra variables, not used for selecting the data, were taken into account. This was later extended to a wider class of designs, which encompasses the well-known case-control study and many others. The method was shown to be consistent and more efficient than the commonly used weighted approach in all cases analysed, but not as robust to model misspecifications. The method is strongly connected to propensity scores and a discussion between their similarities and differences were also conducted. Both approaches were later combined, providing an alternative method for estimating treatment effects that could be applied in outcome-dependent problems. Finally, we discussed its asymptotic efficiency by numerically deriving the semiparametric efficiency bound. The proposed estimator seemed to achieve, for the some specific scenarios, the semiparametric efficiency bound. For a discrete response the equality is mathematically guaranteed and the CML+eS method is thus semiparametric fully efficient.