U of M CSCI 8715 - Comparing path-based and vertically-partitioned RDF databases

Unformatted text preview:

Comparing path-based and vertically-partitioned RDF databases AbstractOverview & ObjectivesRelated WorkImportance and Relevance Task List Proposed Test Scenarios Key ConceptsReferences10/11/2007 Comparing path-based and vertically-partitioned RDF databases Preetha Lakshmi & Chris Mueller CSCI 8715 AbstractGiven the increasing prevalence of RDF data formats for storing and sharing data on the Semantic Web, efficient storage mechanisms for RDF data are also becoming increasingly important. Two recent and novel storage concepts open the door for significantly better querying efficiency. The first, proposed by Matono, et al (2005), models RDF data as a graph, then stores materialized path expressions for efficient querying. The second, proposed by Abadi, et al (2007), stores RDF triples in a vertically-partitioned column-oriented database, in which each RDF property is stored in a single table. Our objective is to compare these two storage methods to find relative strengths and weaknesses on a wide range of possible queries.Overview & ObjectivesThe Semantic Web initiative hopes to catalog the world's information by tagging data with relevant identifiers and descriptors. The aim is to transform the web from its current state—a cumbersome and irregular accumulation of human language—into a more cohesive integrated whole, with distinct names and categories for each piece of data, thus making the web machine-readable as well as human-readable. RDF (Resource Description Framework), a set of W3C specifications, has emerged as the de facto standard for sharing Semantic Web information. RDF is composed of triples: <subject, predicate, object>. (Some sources use the term property in place of predicate; we will use both interchangeably.) For example, one triple might be Pablo Picasso painted Guernica. Another triple might indicate that Guernica hasType painting, suggesting that this particular item Guernica is just one instance of a more general class of items called paintings. RDF data can be thought of as a graph. In the example above, we see that Picasso and Guernica are both nodes in the graph, and they are connected by an edge painted. We represent the node painting with a different color to indicate that it belongs to high-level schema information, rather than instance-level data. Figure A: RDF data as a directed graphRecently we have seen the publication of two novel schemes for storing RDF triples in databases. One GuernicapaintedPicassopaintinghasTypescheme, developed by MIT, uses vertical partitioning to store RDF triples; each table represents a predicate and stores subject-object pairs in two columns. Vertical partitioning has been shown to be especially efficient in the column-oriented databases C-Store. The following tables might be used in the C-Store model.paintingPablo Picasso GuernicaPablo Picasso Les Demoiselles d'AvignonusesMediumGuernica oil paintLes Demoiselles d'Avignon oil painthasTypeGuernica paintingLes Demoiselles d'Avignon paintingPablo Picasso painterFigure B: Table structure in vertically partitioned RDF data store (C-Store)Another model recently proposed is path-based storage, in which RDF schema data, such as the generic triple painters paint paintings, is stored separately from RDF data. In addition, the possible paths between nodes in the RDF graph are stored and nodes are related to these possible paths. For example, since Picasso has painted many paintings, they will all be related to the painter with the path '#painted'. If we extended the graph to include media as a property of a painting, possible path expression between the painter and the medium might be '#painted<#usesMedium'. Traversing all possible paths of this type from any painter would deliver all media used by that artist. The path-based model efficiently retrieves path-based queries which requires extensive graph traversals.These two proposals vertical partitioning and path-based storage represent the most advanced RDF storage mechanisms to date. However, the relative efficiencies of these schemas have not yet been compared. In both approaches, the RDF graph has become the conceptual key for realizing the models. Both approaches use materialized storage of path expressions, i.e. precalculating inferred properties, to reduce processing time during queries. Given the two RDF storage schemas described above, our objective is to compare the efficiency of these methods. We expect to find differences between these implementations when posing a variety of queries across differing datasets. We hypothesize that queries will perform significantly better with the path storage model with RDF data whose schema has significant depth. The path-based schema directly captures RDF schema, including class hierarchy and inheritance. This is not directly captured in vertical-partitioning, so querying for schema information is less efficient. On the other hand, verticalpartitioning performs well on queries for RDF data lacking hierarchy. An insert in the vertically-partitioned database is easier and faster than in path-based database; in the latter case, inserts might require changes to the schema and a fresh depth-first traversal of the RDF graph. Related WorkThe simplest way to store RDF data is in a triple store, essentially one large table with three columns for subject, predicate, and object. Variations on the triple-store have shown improvements in efficiency and reduced the number of self-joins needed when issuing complex queries.A normalized triple store attempts to improve the efficiency of the triple store. A Statements table, which stores RDF triples in three columns, as well as a Literals table and a Resources table make up the basic table schema. In RDF, literals refer to literal values, such as strings or integers, and Resources refer to URIs or a. The Statements table contains references to items in the Literals and Resources table, reducing disk space usage. A variation on this approach, the denormalized triple store attempts to limit the number of joins that would occur across the Statement, Resources, and Literals tables. Instead of always storing a reference to the Literals and Resource tables, the Statements table will hold the literal or resource within itself so long as that resource or literal is smaller than a certain limit, e.g. < 255 characters. Jena1 made use of normalized triple stores, and Jena2 makes use of denormalized triple stores. (Wilkinson 2003) Oracle also uses


View Full Document

U of M CSCI 8715 - Comparing path-based and vertically-partitioned RDF databases

Documents in this Course
Load more
Download Comparing path-based and vertically-partitioned RDF databases
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Comparing path-based and vertically-partitioned RDF databases and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Comparing path-based and vertically-partitioned RDF databases 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?