Improving the Performance of Near Real-Time Data Warehouses

Wang, Zhe

Improving the Performance of Near Real-Time Data Warehouses

Wang, Zhe

Identifier: http://hdl.handle.net/2292/25293

Issue Date: 2014

Reference: 2014

Degree Grantor: The University of Auckland

Rights: Copyright: The Author

Rights (URI): https://researchspace.auckland.ac.nz/docs/uoa-docs/rights.htm

Abstract:

A data warehouse is a database, which is used for data analysis and reporting purposes. When a data warehouse is used in business applications, a requirement is that data is processed and refreshed in a timely fashion. Typically, data warehouses are updated in a batch fashion. But the delay in this off-line update can not meet the high demand of real-time updating. Active data warehousing is considered to be an alternative to the traditional data warehousing mechanism. It provides a better way to process real-time data in order to achieve a higher consistency between the stored information and the latest data updates. One of the key components of performing real-time data integration is the join between the incoming data stream and a disk relation. There is a heated discussion about the trade-off between the throughput and memory usage in streambased join algorithms. Many efforts have been made in the stream-based join research area to figure out the best practice. R-MESHJOIN is a novel and improved streambased join algorithm, which could tune the memory allocation of key components. On the one hand, R-MESHJOIN removes the dependency between the number of iterations to bring the disk relation into memory and the size of partitions in an internal queue of the stream data. On the other hand, the R-MESHJOIN algorithm is based on equi-join, which means the join performed is based on join attribute equality. In some scenarios, the stream-based join is not limited to equi-join. Similarity join (also known as non-equi join) is also required. Clearly, an improved algorithm is needed for handling similar scenarios. This thesis proposes a trie-based data structure for performing non-equi stream-based joins. To handle the similarity join scenario and store the matching attribute, we build an ordered tree data structure based on a trie. The join attribute is stored in the trie. The trie-based similarity join structure is integrated with R-MESHJOIN. To perform a non-equi join based on R-MESHJOIN, the algorithm traverses the trie to complete the fuzzy match with the join attribute and picks up the corresponding tuple as a result. A prototype of the trie-based data structure and integrated R-MESHJOIN algorithm have been implemented and the overheads of the algorithm are measured.

Description:

Full text is available to authenticated members of The University of Auckland only.

Show full item record

Files in this item

Name: whole.pdf

Size: 1.436Mb

Format: PDF

This item appears in the following Collection(s)

Masters Theses - Authenticated Access [6749]

Improving the Performance of Near Real-Time Data Warehouses

Improving the Performance of Near Real-Time Data Warehouses

Abstract:

Description:

Files in this item

This item appears in the following Collection(s)

Search ResearchSpace

Browse

All of ResearchSpace

This Collection

Statistics

Improving the Performance of Near Real-Time Data Warehouses

Improving the Performance of Near Real-Time Data Warehouses

Abstract:

Description:

Files in this item

This item appears in the following Collection(s)

Share

Search ResearchSpace

Browse

All of ResearchSpace

This Collection

Statistics