Abstract:
Embedded functional dependencies (eFDs) were recently introduced to tailor
relational schema design to data completeness requirements of applications. They
also facilitate data cleaning and data integration. A problem that is essential to
unlocking these applications is the discovery of all eFDs that hold on a given data
set. We show that the discovery problem of eFDs is NP-complete, W[2]-complete
in the output, and has a minimum solution space that is larger than the maximum
solution space for functional dependencies. Despite these computational challenges,
we use novel data structures and search strategies to develop row-efficient, columnefficient, and hybrid algorithms that can efficiently solve the discovery problem for
eFDs on large real-world benchmark data sets. Our experiments also demonstrate
that the algorithms scale well in terms of their design targets, and that ranking
the eFDs by the number of redundant data values they cause can provide useful
guidance in identifying meaningful eFDs for applications. Finally, we demonstrate
the benefits of introducing completeness requirements and ranking by the number
of redundant data values for approximate and genuine functional dependencies.