Multiple Imputation Through Statistical Learning

Deng, Yongshi

Multiple Imputation Through Statistical Learning

Degree Grantor

The University of Auckland

Abstract

Multiple imputation (MI) is becoming increasingly popular for handling missing data. However, some conventional MI methods have limitations when processing large datasets with complex data structures. Their imputation performances usually rely on proper specification of imputation models, which requires expert knowledge of the inherent relations among variables. Moreover, these standard approaches tend to be computationally inefficient for medium and large datasets. This thesis presents an in-depth exploration of applying machine learning to tackle the challenge of missing data in large datasets. It focuses on developing automated and scalable MI procedures by leveraging advanced machine learning algorithms and incorporating statistical techniques to improve their imputation performance, while achieving computational efficiency. The key contributions of this thesis are the development of three R packages: mixgb, miae and vismi. We propose mixgb, which is based on XGBoost, and utilises subsampling, and predictive mean matching (PMM). We demonstrate that mixgb has superior performance as an automated MI procedure, and that it is notably more time-efficient in imputing large datasets compared to other MI methods that are based on fully conditional specification (FCS). We present an autoencoder-based MI implementation miae, which has a computational edge over FCS-based methods. Our evaluation reveals how various hyperparameter settings impact imputation performance, identifying specific challenges associated with applying autoencoders to MI. We propose several strategies to improve its imputation performance, such as using random sampling as the initial imputation method and adjusting the variance of the weight initialisation for training stability. We have also developed vismi, a visual diagnostic toolkit for MI, providing a wide range of functions for inspecting the distributional discrepancy between the observed and imputed data, as well as for overimputation diagnostics.