Scalable Reduction of Large Datasets to Interesting Subsets

Gregory Todd Williams, Jesse Weaver, Medha Atre, and James A. Hendler
Tetherless World Constellation, Rensselaer Polytechnic Institute, Troy, NY, USA

Winner of the 2009 Billion Triples Challenge

Accepting BTC Award
Jesse's BTC Presentation

Abstract

With a huge amount of RDF data available on the web, the ability to find and access relevant information is crucial. Traditional approaches to storing, querying, and inferencing fall short when faced with web-scale data. We present a system that combines the computational power of large clusters for enabling large-scale inferencing and data access with an efficient data structure for storing and querying this accessed data on a traditional personal computer or smaller embedded device. We present results of using this system to load the Billion Triples Challenge dataset, fully materialize RDFS inferences, and extract an ``interesting'' subset of the data using a large cluster, and further analyze the extracted data using a traditional personal computer.

Illustration of System
Inferencing349 sec
Extracting/Reducing940 sec
BitMat Creation25 sec
TOTAL1,314 sec (~22 min)

Documents:

Upper Ontology (31 triples):

BTC2009-related Ontologies (13,599 triples):

Statistics on RDFS Closure of BTC2009 Dataset:

(Note that the closure was computed with the exclusion of rules lg, gl, rdfs1, rdfs4a, rdfs4b, rdfs6, rdfs8, and rdfs10, and that it also excludes extensions of RDF/S terms like [my:subclass rdfs:subPropertyOf rdfs:subClassOf].)

Reduced Dataset (784,783 triples):

Statistics about Reduced Dataset:

Valid XHTML + RDFa Distill the RDF