Abstract:
Real-time data streams allow decision making based on the latest information available. However, data streams often change over time. A good classifier should identify and adapt to this change. The classifier must also classify new instances efficiently enough to match the velocity of the stream. In meta-learning, we use additional data to inform our decisions on when and how to adapt our classification framework. This thesis demonstrates how meta-learning can be used to improve classification of instances in data streams. We do this by addressing three major challenges. The first challenge is to reuse previously trained classifiers to improve speed and accuracy in streams. We design and implement two fast, novel classification frameworks for this purpose. We show experimentally that our one algorithm has shorter runtime than a state-of-the-art classifier reuse framework and achieves similar accuracy. The other algorithm is as accurate as a leading state-of-the-art ensemble approach with shorter runtime. The second challenge is to recognise and use past metadata patterns to improve drift detection. We propose lift-per-drift, an evaluation metric for drift detection that does not rely on knowledge of where true drifts have occurred. We design a framework that can predict when concept drift is about to occur based on stream metadata, and experimentally show that this framework is able to either reduce delay and false positive drift detection or increase lift-per-drift and reduce false negative drift detection, based on a user’s requirements. The third challenge is to use metadata to improve classification of imbalanced data streams. Our framework for binary-class streams can identify examples of both the minority and majority class through dynamically adapting a classifier’s threshold based on recent performance. Our framework for multi-class streams uses a novel severity metric to identify and adapt to change at a class level. Our experiments show that these frameworks improve classification of minority classes in data streams. We conclude this work by demonstrating the value of our proposed techniques when applied to a real-world problem: understanding the likelihood of car crashes causing casualties in New Zealand.