Detecting and Quantifying Concept Drift for Data Stream

Reference

Degree Grantor

The University of Auckland

Abstract

Concept drift describes changes in the underlying distribution of streaming data. Concept drift research involves the development of methodologies and techniques for drift detection, understanding, and adaptation. Data analysis shows that if the drift is not addressed, machine learning in a concept drift environment will result in poor learning results. Most drift detection methods focus on supervised learning, but the labels of streaming data are sometimes expensive. Most drift understanding methods quantify drift by data distribution. These methods require a certain number of data. This thesis investigates two research streams: (1) An unsupervised drift detection method, which does not require prior knowledge of the data distribution, and (2) A framework that quanti es the severity of concept drift from model perspective. In the rst part, we focus on feature drift that shifts boundaries of mode and present an unsupervised framework to detect feature drift without labels. The framework detects abrupt and gradual feature drift by two distance functions, Wasserstein distance and Energy distance, and discusses feature changes in the data stream. A less explored area is describing the changes in the data stream. Crucially, the ability to describe changes in the data stream would enable a better understanding of the changing dynamics in the relationships that take place over time. In particular, we seek to answer the following question: Whether the distribution changes of important features will also cause concept drift. Experimental results show that the proposed framework detects and describes the feature drift. In the second part, we propose a framework to quantify the severity of concept drift from model perspective. Our framework is based on the most popular data stream mining algorithm - Hoe ding Tree. Our approach quanti es the concept drift without data. This reduces the probability of data leaks. The severity of concept drift can be used as a guideline for choosing drift adaptation strategies. Our framework maps Hoe ding trees into groups of vectors and measures similarity and distance between vector groups. The larger similarity/lower distance indicates two trees are similar, and the lower similarity/larger distance indicates two trees are di erent.

Description

Full Text is available to authenticated members of The University of Auckland only.

DOI

Related Link

Keywords

ANZSRC 2020 Field of Research Codes