Abstract:
SPARQL Protocol and RDF Query Language (SPARQL) has emerged as the W3C recommended language for querying data stored in a Resource Description Framework (RDF) format. The exible pattern matching capabilities of SPARQL entail performance challenges for complex queries. Query processing in triplestores (databases specializing in the storage of RDF) encounter all the same problems as query processing in relational databases, and additional problems due to the schema-free nature of RDF. Modern query optimizers are complex and use heuristics as well as statistics. Using an exhaustive (entire database as well as all combinations of triple components) statistics generation and storage approach produce a signi cant overhead. Currently, there is no pure online costbased optimizer for triplestores. Online optimizers are capable of reducing the overhead involved in creating and managing statistics. In this thesis we explore the hypothesis that just storing selectivity statistics for predicates enable e ective optimization of typical benchmark queries. Using the above mentioned hypothesis, we introduce a pure online optimizer for triplestores, the Learning Statistics Optimizer (LSO), which learns from query executions. LSO has a low performance and memory consumption for storing the statistics, and it provides easily extendable storage architecture for statistics. Online optimizing is advantageous, since it makes sure that work is only done if it is needed. We implemented the LSO in a main-memory triplestore, PDStore. We evaluated the performance of LSO with the Berlin SPARQL Benchmark (BSBM) queries on their 250k triples dataset. The results showed that the optimizer's runtime overhead is very small, it keeps generating good query execution plans and it is highly competitive with Sesame, Virtuoso, D2R and Jena TDB. PDStore is also benchmarked using a custom built readwrite benchmark called Flight Booking Simulation Benchmark (FBSB).