Abstract:
As the amount of data grows, there is a strong pull from modern applications and push from query processing technologies to manage the uncertainty in data. Indeed, veracity is one of the four ‘V’s that big data is often characterized by. Each application may have their version of imprecision. As part of the data integration process, uncertain data is harvested from multiple sources. As data is likely to enjoy different keys in different sources, it cannot be assumed that the integrated data set satisfies keys with full certainty. Nevertheless, keys are fundamental for any database system. Discovering keys from given data is therefore a core activity of data profiling. As veracity in data can be adequately embraced by probabilistic databases, it makes perfect sense to profile the marginal probability by which keys hold in a given data set. That is, we want to compute for each attribute set, the sum of the probabilities of those possible worlds in which the attribute set forms a key. Since the number of possible worlds is usually large, we need an efficient framework to profile such probabilistic keys. The basic research question we want to address in this dissertation is: “What is a good computational framework for profiling probabilistic keys from probabilistic databases?” Profiling probabilistic keys has not been studied previously. Our approach is to use the MapReduce framework to apply some traditional key mining algorithm independently on each possible world, and for each attribute set, to sum up the probabilities of those worlds that satisfy the key. The traditional key mining algorithm we use in this dissertation is based on hypergraph transversals. The first main contribution of this research is to define the problem of profiling probabilistic keys from probabilistic databases and to design a MapReduce algorithm that solves it. The second main contribution is to demonstrate the scalability of the proposed solution. For that purpose, it will be demonstrated that the profiling time grows linearly in the number of possible worlds. However, the number of possible worlds in our experiments is much smaller than that typically expected for probabilistic databases. This limitation can be overcome by using distributed high-computing facilities with multiple computing nodes each featuring several processors.