Abstract:
Machine Learning should be able to support decision-making by focusing on purely logical conclusions based on historical data. If this data is biased, however, that bias will be transferred to the model and remains undetected as the performance is validated on a test set drawn from the same biased
distribution. Existing strategies for bias identification and mitigation generally
rely on some sort of knowledge of the bias or the ground truth. This
reliance is problematic, particularly if the user is not aware of the bias, no
ground truth knowledge is available, or no concrete target task is defined
yet, e.g., during data gathering.
We argue that some indication of future problems is present in the historical
dataset itself. Extracting it as early as during data gathering can
help correct the flaws on-the-fly or create awareness in researchers working
with the dataset.
In this thesis, we aim to identify selection biases on the historical data
alone when no ground-truth information is available. Selection biases stem
from a non-uniform sampling process. To mitigate them, we generate additional
data points that bridge the gap between sample and ground-truth
distribution. Pioneering this research topic, we suggest three algorithms built
on the assumption that the distribution of sufficiently large and unbiased
datasets should be smooth, without any sudden drops in density.
Extensive experiments and discussions highlight the need for such data
analysis tools and illustrate that each of our methods has its own merits.
Overall, we contribute to a better understanding of the data we use and trust
and challenge existing procedures in machine learning that accept flawed
data as given and treat symptoms rather than causes.