Ph.D. Theses

Bit-by-Bit: Indexing and Querying RDF data using Compressed Bit-Vectors

By Medha Atre
Advisor: James A. Hendler
August 19, 2011

The Resource Description Framework (RDF) is widely being adapted as a for information representation in various domains, e.g., biotechnology (UniProt), government data (the data.gov project), scholarly resources (DBLP), web resources (DBPedia), connection between people (FOAF) and many more. Due to the increasing use of RDF as a standard for data representation in the past few years, the amount of RDF data available on the web has increased at a break-neck speed. This has necessitated addressing two important issues -- efficient ways of (a) storing, and (b) querying the RDF data.

In this thesis, we propose a novel way of storing RDF data and an efficient way of processing SPARQL join queries -- BitMat -- which is specifically aimed at low-selectivity SPARQL join queries. BitMat uses the well-known technique of compressed bit-vectors to store RDF data by representing it as a 3-dimensional bit-cube (subject, predicate, object as each dimension of the bitcube). The key aspect of our join query processing is a novel 2-phase algorithm combining the idea of semi-joins and multi-way joins.

We have also extended BitMat's technique of /pruning/ the candidate RDF triples to process DISTINCT and OPTIONAL clauses in SPARQL. Specifically for the DISTINCT clause, BitMat's pruning phase can be used to omit BitMats that are not needed for final result generation. This further enhances the memory requirements of the query processor. We have also given a detailed description of how other SPARQL clauses can be handled -- either using BitMat algorithm, or by working on the results of BitMat algorithm.

RDF data is primarily graph data, hence we have also proposed to extend the technique of compressed bit-vectors used in BitMat to solve /label order constrained reachability/ queries on RDF graph. We propose a system -- BitPath -- which uses compressed bit-vectors to build specific indices for each node in the RDF graph and a query processing algorithm based on greedy pruning and divide-and-conquer approach to solve this problem.

Thus in this thesis we are addressing two important aspects of querying RDF data -- (a) performance intensive SPARQL queries, and (b) label-order-constrained-reachability queries. We show that using characteristics of compressed bit-vectors in both -- BitMat and BitPath algorithms -- we can come up with efficient solutions to both the problems.

Return to main PhD Theses page