Efficient joins to process stream data

Reference

Degree Grantor

The University of Auckland

Abstract

Data integration used to be offline, but real-time data integration has become more and more important. Research into stream databases can be naturally applied to near-realtime data integration. Several important problems in near-real-time data integration can be naturally expressed as joins. Many stream joins assume all join inputs to be streams. Recently, interest has been growing in joins with heterogeneous input, in particular joins between streams and disk-based input. MESHJOIN is a well known algorithm published in this area. The algorithm was designed particularly for application scenarios where memory resources are limited. However, the algorithm suffers from some limitations. Briefly, the memory distribution among the join components and the strategy used for accessing the disk-based data are suboptimal. This thesis provides an independent analysis of the MESHJOIN algorithm. The focus of analysis is on equijoins as one of the most important special cases of joins. It has been shown that if a realistic distribution is assumed on stream data, such as a Zipfian distribution, MESHJOIN performs suboptimally. A set of algorithms have been developed that address the problems in MESHJOIN and they perform better than MESHJOIN in defined settings. In the end, three robust algorithms have been developed for both sorted and unsorted disk-based data. For these algorithms cost models have been developed for tuning the algorithms and validation of our implementation. An experimental study has been carried out for comparing these algorithms empirically. For that purpose a synthetic workload generator has been designed and developed. With the synthetic datasets, measurements have been taken in experiments that validate the cost models of the algorithms. The implemented algorithms are made available publicly as open source for independent analysis. In the future this research can be extended in two directions. One is to improve the join operators further. The other is to apply the join operators in emerging application scenarios.

Description

DOI

Keywords

ANZSRC 2020 Field of Research Codes

Collections