X-HYBRIDJOIN for near-real-time data warehousing

Naeem, MA; Dobbie, Gillian; Weber, Gerald

dc.contributor.author	Naeem, MA	en
dc.contributor.author	Dobbie, Gillian	en
dc.contributor.author	Weber, Gerald	en
dc.contributor.editor	Fernandes, AAA	en
dc.contributor.editor	Gray, AJG	en
dc.contributor.editor	Belhajjame, K	en
dc.coverage.spatial	Manchester, UK	en
dc.date.accessioned	2012-03-14T23:40:02Z	en
dc.date.issued	2011	en
dc.identifier.citation	Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 7051:33-47 2011	en
dc.identifier.isbn	978-3-642-24576-3	en
dc.identifier.issn	0302-9743	en
dc.identifier.uri	http://hdl.handle.net/2292/14402	en
dc.description.abstract	In order to make timely and effective decisions, businesses need the latest information from data warehouse repositories. To keep these repositories up-to-date with respect to end user updates, near-realtime data integration is required. An important phase in near-real-time data integration is data transformation where the stream of updates is joined with disk-based master data. The stream-based algorithm Mesh Join (MESHJOIN) has been proposed to amortize disk access over fast stream. MESHJOIN makes no assumptions about the data distribution. In real world applications, however, skewed distributions can be found, e.g, certain products are sold more frequently than the remainder of the products. The question arises, how much does MESHJOIN loose in terms of performance by not adapting to data skew. In this paper we perform a rigorous experimental study analyzing the possible performance improvements while considering typical data distributions. For this purpose we design an algorithm Extended Hybrid Join (X-HYBRIDJOIN) that is complementary to MESHJOIN in that it can adapt to data skew and stores parts of the master data in memory permanently, reducing the disk access overhead significantly. We compare the performance of X-HYBRIDJOIN against the performance of MESHJOIN. We take several precautions to make sure the comparison is adequate and focuses on the utilization of data skew. The experiments show that considering data skew offers substantial room for performance gains that cannot be used by non-adaptive approaches such as MESHJOIN.	en
dc.publisher	Springer Verlag	en
dc.relation.ispartof	28th British National Conference on Databases	en
dc.relation.ispartofseries	Lecture Notes in Computer Science	en
dc.rights	Items in ResearchSpace are protected by copyright, with all rights reserved, unless otherwise indicated. Previously published items are made available in accordance with the copyright policy of the publisher. Details obtained from: http://www.sherpa.ac.uk/romeo/issn/0302-9743/	en
dc.rights.uri	https://researchspace.auckland.ac.nz/docs/uoa-docs/rights.htm	en
dc.title	X-HYBRIDJOIN for near-real-time data warehousing	en
dc.type	Conference Item	en
dc.identifier.doi	10.1007/978-3-642-24577-0_5	en
pubs.begin-page	33	en
pubs.volume	7051	en
dc.rights.holder	Copyright: Springer Verlag	en
pubs.end-page	47	en
pubs.finish-date	2011-07-14	en
pubs.start-date	2011-07-12	en
dc.rights.accessrights	http://purl.org/eprint/accessRights/RestrictedAccess	en
pubs.subtype	Proceedings	en
pubs.elements-id	245466	en
dc.relation.isnodouble	12680	*
dc.relation.isnodouble	12681	*
pubs.org-id	Science	en
pubs.org-id	School of Computer Science	en
dc.identifier.eissn	1611-3349	en
pubs.record-created-at-source-date	2012-03-15	en