Abstract:
A data warehouse is a database, which is used for data analysis and reporting purposes. When a data warehouse is used in business applications, a requirement is that data is processed and refreshed in a timely fashion. Typically, data warehouses are updated in a batch fashion. But the delay in this off-line update can not meet the high demand of real-time updating. Active data warehousing is considered to be an alternative to the traditional data warehousing mechanism. It provides a better way to process real-time data in order to achieve a higher consistency between the stored information and the latest data updates. One of the key components of performing real-time data integration is the join between the incoming data stream and a disk relation. There is a heated discussion about the trade-off between the throughput and memory usage in streambased join algorithms. Many efforts have been made in the stream-based join research area to figure out the best practice. R-MESHJOIN is a novel and improved streambased join algorithm, which could tune the memory allocation of key components. On the one hand, R-MESHJOIN removes the dependency between the number of iterations to bring the disk relation into memory and the size of partitions in an internal queue of the stream data. On the other hand, the R-MESHJOIN algorithm is based on equi-join, which means the join performed is based on join attribute equality. In some scenarios, the stream-based join is not limited to equi-join. Similarity join (also known as non-equi join) is also required. Clearly, an improved algorithm is needed for handling similar scenarios. This thesis proposes a trie-based data structure for performing non-equi stream-based joins. To handle the similarity join scenario and store the matching attribute, we build an ordered tree data structure based on a trie. The join attribute is stored in the trie. The trie-based similarity join structure is integrated with R-MESHJOIN. To perform a non-equi join based on R-MESHJOIN, the algorithm traverses the trie to complete the fuzzy match with the join attribute and picks up the corresponding tuple as a result. A prototype of the trie-based data structure and integrated R-MESHJOIN algorithm have been implemented and the overheads of the algorithm are measured.