Slide 1Slide 2Slide 3Slide 4Slide 5Slide 6Slide 7Slide 8Slide 9Slide 10Slide 11Slide 12Slide 13Slide 14Slide 15Slide 16Slide 17Slide 18Slide 19Slide 20Slide 21Slide 22Comparing path-based and vertically-partitioned RDF databasesPreetha Lakshmi & Chris Mueller12/10/2007CSCI 8715Shashi ShekharOutlineMotivationBackground and related workProblem statementOur contributionsAssumptionsExperimental processResultsConclusionsMotivationSemantic Weblibrariesscientific databasesindustrysocial networksComputer-to-computer communicationRDF SchemaSchemaInstanceRDF SchemaRDF Triples<subject, property, object><“www.picasso.net” , first, “Pablo”>Related WorkTriple storeProperty tablesClass property tablesDynamic table model Vertically partitioned tables (Abadi, et al 2007)Path based approach (Matono, et al 2005)Require more self joins, normal joins, NULL value storageVertical PartitioningA table is created for each propertyFirst Subject Object 'r1' 'Picasso''r4' 'August'Last Subject Object 'r1' 'Picasso''r4' 'Rodin'Paints Subject Object 'r1' 'r2''r1' 'r3'... etc.Path-based ModelPath signatures relate to instance dataPath pathid pathexp 1 ''2 '#first'3 '#last'4 '#paints'5 '#title<#paints'6 '#sculpts'7 '#title<#sculpts'Resource name pathid root 'r1' 1 'r1''r2' 4 'r1''r3' 4 'r1''r4' 1 'r4''Picasso' 2 'r1''Pablo' 3 'r1''August' 2 'r4''Rodin' 3 'r4'...Our enhancementProblem StatementGiven: A set of RDF triplesVertical partitioning storage modelPath-based storage model Find: Query plans for the various categories of queries under these two storage schemes. Ob jective: To determine query types that perform comparatively better or worse in two storage models Why is this challenging?Need for efficient storage of structured dataDifferent application domains use RDF, generic storage schemes should support a diverse workload.ContributionsIdentification of benchmark queriesschema, instance, path, and aggregate queries Enhancement to the path-based schema that addresses different types of workloads Comparison of path-based model and vertical partitioning Analysis of cyclic queriesQuery TypesSchema queriesfind all types of artistslist all property nameslist nodes with 2 or more descendants.find the transitive sub-classes of a class 'sculpture'list properties with 2 or more descendantsInstance queriesfind the titles of all paintings by Picassoselect all nodes within one edge-length of R4list all the properties of node r4Schema vs InstancePathNon-pathAggregateCycleRelationshipDiameter Constraintsintermediate nodeterminal nodeConnectionListQuery TypesPath queriesfind the title of any painting painted by anyonedisplay all the titles of work done by artistsfind the names of all the sculptors...with constraint on intermediate nodefind an artist's name where the artifact is a painting...with terminal node constraintsdisplay all the titles of work done by PicassoQuery TypesPath queriesconnection querieslist all the properties of node r4is there a connection between 'Picasso' and 'Guernica'?diameter queriesselect all nodes in the graph within one edge-length of R4non-simple path queriesdetect loops in the dataset starting at 'Picasso'detect loops in the whole datasetQuery TypesAggregate queriesfind all nodes with 2 or more propertieslist all subjects that have two instances of a single propertyRelationship queriesfind any relationship between r1 and r4AssumptionsUsing a small dataset, with the assumption that number of joins and efficiency of the queries will not change significantly with larger datasetsNo explicit storage of the RDF schema in the vertically-partitioned scheme (application independent)INSERT, UPDATE, & DELETE are insignificant compared to SELECTKey nodes in the path-based model are well-definedIn practice, key nodes, would be generated dynamically after user load analysisExperimental ProcessValidation parametersNodesEdgesNumber of joinsNumber of tablesCPU costStorage bytesSetup both schemes in Oracle 10g for the RDF graph shown earlierMaterialized path lengths in path-based schemeGenerated query plansAnalyzed queries based on the validation parametersCycle queries – joins are not supportedDataset used for experiment* For CPU cost and bytes (storage) the entry in the table indicates which scheme used less CPU cycles or occupied less space. In cases where both required an identical or similar amount of computation or storage, we indicate this with “same”.Queries which cannot be answered are indicated by ‘--‘.Experimental ResultsConclusions & ObservationsVertical Partitioning performs well for Short path length, terminal node constraints.Offers storage benefits for instance queries without path expressions.Enhanced Path Based model performs well forSchema queries, path queries, cycle queriesQueries which the original path-based could not address and the enhanced model could answer:Connection queries and diameter queriesPath queries with intermediate node constraintsConclusion (Cont'd)Both the schemes show the same performance on instance queries without path expressions.Both the schemes do not address relationship queriesInteresting results for cycle queriesspecifying the start node gives a bad performance than when the start node is not specifiedspecifying the start node uses Oracle Filter.Future WorkTest large and diverse datasetsTest vertical partitioning with a column-orientated database like MonetDBPruning strategies for cycle queriesImpose join indexesFind approaches to answer relationship queriesStorage classification based on the application domainThank YouQuestions?Please see http://www.cs.umn.edu/~cmueller/cs8715 for a copy of the report that accompanies this presentation, including a full
View Full Document