dc.contributor.author |
Wei, Z |
en |
dc.contributor.author |
Leck, U |
en |
dc.contributor.author |
Link, S |
en |
dc.date.accessioned |
2019-01-09T02:23:05Z |
en |
dc.date.available |
2019-01-09T02:23:05Z |
en |
dc.date.issued |
2018 |
en |
dc.identifier.citation |
CDMTCS Research Reports CDMTCS-524 (2018) |
en |
dc.identifier.issn |
1178-3540 |
en |
dc.identifier.uri |
http://hdl.handle.net/2292/45070 |
en |
dc.description.abstract |
Data profiling is an enabler for efficient data management and effective analytics. The discovery of data dependencies is at the core of data profiling. We conduct the first study on the discovery of embedded uniqueness constraints (eUCs), a recently introduced class of data dependencies that represent unique column combinations embedded in complete fragments of incomplete data. We show that the decision variant of finding a minimal eUC is NP-complete and W[2]-complete in the input size. We also characterize the maximum possible solution size, and show which families of eUCs attain that size. The size is much larger than for the special case of minimal SQL uniques. Despite these challenges, our column-efficient, rowefficient, and hybrid discovery algorithms perform effectively and fast on real-world benchmark and synthetic data. We also propose the computation of small semantic samples of given data sets as a new direction in data profiling. These samples satisfy the same eUCs as the given data set and we showcase how discovery and sampling together provide a pathway towards effective data cleansing and business rule acquisition. |
en |
dc.publisher |
Department of Computer Science, The University of Auckland, New Zealand |
en |
dc.relation.ispartofseries |
CDMTCS Research Report Series |
en |
dc.rights |
Items in ResearchSpace are protected by copyright, with all rights reserved, unless otherwise indicated. Previously published items are made available in accordance with the copyright policy of the publisher. |
en |
dc.rights.uri |
https://researchspace.auckland.ac.nz/docs/uoa-docs/rights.htm |
en |
dc.source.uri |
https://www.cs.auckland.ac.nz/research/groups/CDMTCS/researchreports/index.php |
en |
dc.title |
Discovery Algorithms for Embedded Uniqueness Constraints |
en |
dc.type |
Technical Report |
en |
dc.subject.marsden |
Fields of Research |
en |
dc.rights.holder |
Copyright: The authors |
en |
dc.rights.accessrights |
http://purl.org/eprint/accessRights/OpenAccess |
en |