Selection Bias Identification and Mitigation With No Ground Truth Information

Dost, Katharina

Selection Bias Identification and Mitigation With No Ground Truth Information

Dost, Katharina

Identifier: https://hdl.handle.net/2292/66291

Issue Date: 2022

Degree Name: PhD

Degree Grantor: The University of Auckland

Rights: Copyright: The author

Rights (URI): https://researchspace.auckland.ac.nz/docs/uoa-docs/rights.htm

Abstract:

Machine Learning should be able to support decision-making by focusing on purely logical conclusions based on historical data. If this data is biased, however, that bias will be transferred to the model and remains undetected as the performance is validated on a test set drawn from the same biased distribution. Existing strategies for bias identification and mitigation generally rely on some sort of knowledge of the bias or the ground truth. This reliance is problematic, particularly if the user is not aware of the bias, no ground truth knowledge is available, or no concrete target task is defined yet, e.g., during data gathering. We argue that some indication of future problems is present in the historical dataset itself. Extracting it as early as during data gathering can help correct the flaws on-the-fly or create awareness in researchers working with the dataset. In this thesis, we aim to identify selection biases on the historical data alone when no ground-truth information is available. Selection biases stem from a non-uniform sampling process. To mitigate them, we generate additional data points that bridge the gap between sample and ground-truth distribution. Pioneering this research topic, we suggest three algorithms built on the assumption that the distribution of sufficiently large and unbiased datasets should be smooth, without any sudden drops in density. Extensive experiments and discussions highlight the need for such data analysis tools and illustrate that each of our methods has its own merits. Overall, we contribute to a better understanding of the data we use and trust and challenge existing procedures in machine learning that accept flawed data as given and treat symptoms rather than causes.

Show full item record