Abstract:
Outlier detection, also referred to as anomaly detection, is one of the most popular data mining
problems with many applications. With the increasing data volume, existing outlier detectors
are either too time-consuming in large-scale data sets or incapable of detecting certain types
of outliers. Ensemble techniques have become popular for unsupervised outlier detection due
to their efficiency and accuracy. However, most recent advanced ensemble algorithms lack the
capability of detecting local outliers.
This thesis first presents a generic framework, named LSH-LOCI, a novel local outlier ensemble
method based on locality-sensitive hashing forest (LSH Forest). LSH-LOCI builds
upon the LSH Forest data structure to efficiently find the local neighborhood areas to assess
local outliers and exploits the subsampling mechanism to run in linear time. Therefore,
LSH-LOCI does not require expensive pairwise distance computations and can capture the
data distribution’s diverse local densities. However, the time complexity of evaluating the `p
LSH family O(d), where d is the number of data dimensions. In order to cope with highdimensional
data more efficiently, we further instantiate our framework with RFiForest and
iQuadTree, which takes the random feature partitioning to built the isolation tree. RFiForest
and iQuadTree both run significantly faster than LSH-LOCI and achieve highly accurate
results. Our theoretical analysis shows that LSH-LOCI and RFiForest provide a tight bound
on density estimation by utilizing the distance preserving property of LSH. Empirically, our
framework outperforms the recent LSH iForest and conventional local outlier detectors. LSHLOCI
also provides competitive detection performance compared to recent ensemble outlier
methods on many real-world data sets.