Abstract:
The goal of data integrity is to maintain and ensure the accuracy and consistency of data over its life-cycle. It is a critical aspect of the design, implementation and usage of any system which stores, processes, or retrieves data. Edgar Codd, the Turing award winning inventor of the relational model of data, proposed domain, entity, and referential integrity as the three most basic classes of constraints that a database system enforces to help achieve data integrity. In particular, referential integrity ensures that data in some part of the database correctly references some other part in the database. Despite their fundamental importance to data processing, referential integrity constraints have not been well-studied for real-world databases that conform to the industry standard SQL. In such databases, incomplete information is represented by null marker occurrences. The SQL standard defines three semantics that can be applied when referential integrity is verified for SQL data: Full, simple and partial semantics. Surprisingly, commercial and open-source database management systems only offer built-in support for simple semantics. This gap between the SQL standard and its implementations can have a large negative impact on the accuracy and consistency of real-world data, and therefore severely limit the insight from data analysis, and the profit from data-driven decision making. The thesis addresses this long-standing gap between the industry standard and its implementations in actual database systems. My research develops techniques that aim to lift the partial semantics of referential integrity constraints to a first-class citizen of database systems. In particular, first techniques are developed that allow database systems to i) discover, ii) enforce, and iii) validate partial referential integrity constraints efficiently. Furthermore, some applications are identified that highlight the benefit of partial referential integrity to core data processing tasks such as queries, updates, and repairs. My findings can be used to improve daily database practice.