Efficient joins to process stream data

ResearchSpace/Manakin Repository

Show simple item record

dc.contributor.advisor Dobbie, G en
dc.contributor.advisor Weber, G en
dc.contributor.author Naeem, Muhammad en
dc.date.accessioned 2012-04-10T23:01:28Z en
dc.date.issued 2012 en
dc.identifier.uri http://hdl.handle.net/2292/16925 en
dc.description.abstract Data integration used to be offline, but real-time data integration has become more and more important. Research into stream databases can be naturally applied to near-realtime data integration. Several important problems in near-real-time data integration can be naturally expressed as joins. Many stream joins assume all join inputs to be streams. Recently, interest has been growing in joins with heterogeneous input, in particular joins between streams and disk-based input. MESHJOIN is a well known algorithm published in this area. The algorithm was designed particularly for application scenarios where memory resources are limited. However, the algorithm suffers from some limitations. Briefly, the memory distribution among the join components and the strategy used for accessing the disk-based data are suboptimal. This thesis provides an independent analysis of the MESHJOIN algorithm. The focus of analysis is on equijoins as one of the most important special cases of joins. It has been shown that if a realistic distribution is assumed on stream data, such as a Zipfian distribution, MESHJOIN performs suboptimally. A set of algorithms have been developed that address the problems in MESHJOIN and they perform better than MESHJOIN in defined settings. In the end, three robust algorithms have been developed for both sorted and unsorted disk-based data. For these algorithms cost models have been developed for tuning the algorithms and validation of our implementation. An experimental study has been carried out for comparing these algorithms empirically. For that purpose a synthetic workload generator has been designed and developed. With the synthetic datasets, measurements have been taken in experiments that validate the cost models of the algorithms. The implemented algorithms are made available publicly as open source for independent analysis. In the future this research can be extended in two directions. One is to improve the join operators further. The other is to apply the join operators in emerging application scenarios. en
dc.publisher ResearchSpace@Auckland en
dc.relation.ispartof PhD Thesis - University of Auckland en
dc.rights Items in ResearchSpace are protected by copyright, with all rights reserved, unless otherwise indicated. Previously published items are made available in accordance with the copyright policy of the publisher. en
dc.rights.uri https://researchspace.auckland.ac.nz/docs/uoa-docs/rights.htm en
dc.rights.uri http://creativecommons.org/licenses/by-nc-sa/3.0/nz/ en
dc.title Efficient joins to process stream data en
dc.type Thesis en
thesis.degree.grantor The University of Auckland en
thesis.degree.level Doctoral en
thesis.degree.name PhD en
dc.rights.holder Copyright: The author en
pubs.author-url http://hdl.handle.net/2292/16925 en
pubs.elements-id 342100 en
pubs.record-created-at-source-date 2012-04-11 en


Full text options

This item appears in the following Collection(s)

Show simple item record

http://creativecommons.org/licenses/by-nc-sa/3.0/nz/ Except where otherwise noted, this item's license is described as http://creativecommons.org/licenses/by-nc-sa/3.0/nz/

Share

Search ResearchSpace


Advanced Search

Browse