Abstract:
Computing the functional dependencies that hold on a given data set is one
of the most important problems in data profiling. Our research advances state-
of-the-art in various ways. Utilizing new data structures and original techniques
for the dynamic computation of stripped partitions, we devise a new hybridization
strategy that outperforms the best algorithms in terms of efficiency, column-, and
row-scalability. This is demonstrated on real-world benchmark data. We show that
current outputs contain many redundant functional dependencies, but canonical
covers greatly reduce output sizes. Smaller representations of outputs are easier
to comprehend and use. We propose the number of redundant data values as a
natural measure to rank the output of discovery algorithms. Our ranking assesses
the relevance of functional dependencies for the given data set.