DOC PREVIEW
Duke CPS 296.1 - A Fast Index for Semistructured Data

This preview shows page 1-2-3 out of 10 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 10 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 10 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 10 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 10 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

A Fast Index for Semistructured DataBrian F. Cooper1,2, Neal Sample1,2, Michael J. Franklin1,3, Gísli R. Hjaltason1, Moshe Shadmon11RightOrder Incorporated3850 N. First St.San Jose, CA 95134 USA2Department of Computer ScienceStanford UniversityStanford, CA 94305 USA3Computer Science DivisionUniversity of CaliforniaBerkeley, CA 94720 USA{cooperb,nsample}@db.stanford.edu, [email protected],{gislih,moshes}@rightorder.comAbstractQueries navigate semistructured data via pathexpressions, and can be accelerated using anindex. Our solution encodes paths as strings, andinserts those strings into a special index that ishighly optimized for long and complex keys. Wedescribe the Index Fabric, an indexing structurethat provides the efficiency and flexibility weneed. We discuss how "raw paths" are used tooptimize ad hoc queries over semistructureddata, and how "refined paths" optimize specificaccess paths. Although we can use knowledgeabout the queries and structure of the data tocreate refined paths, no such knowledge isneeded for raw paths. A performance studyshows that our techniques, when implemented ontop of a commercial relational database system,outperform the more traditional approach ofusing the commercial system’s indexingmechanisms to query the XML.1. IntroductionDatabase management systems are increasingly beingcalled upon to manage semistructured data: data with anirregular or changing organization. An exampleapplication for such data is a business-to-business productcatalog, where data from multiple suppliers (each withtheir own schema) must be integrated so that buyers canquery it. Semistructured data is often represented as agraph, with a set of data elements connected by labeledrelationships, and this self-describing relationshipstructure takes the place of a schema in traditional,structured database systems. Evaluating queries oversemistructured data involves navigating paths through thisrelationship structure, examining both the data elementsand the self-describing element names along the paths.Typically, indexes are constructed for efficient access.One option for managing semistructured data is tostore and query it with a relational database. The datamust be converted into a set of tuples and stored in tables;for example, using tools provided with Oracle 8i/9i [25].This process requires a schema for the data. Moreover,the translation is not trivial, and it is difficult to efficientlyevaluate queries without extensions to the relationalmodel [26]. If no schema exists, the data can be stored asa set of data elements and parent-child nestingrelationships [17]. Querying this representation isexpensive, even with indexes. The STORED system [12]uses data mining to extract a partial schema. Data thatdoes not fit the schema well must be stored and queried inits native form.An alternative option is to build a specialized datamanager that contains a semistructured data repository atits core. Projects such as Lore [24] and industrial productssuch as Tamino [28] and XYZFind [29] take thisapproach. It is difficult to achieve high query performanceusing semistructured data repositories, since queries areagain answered by traversing many individual element toelement links, requiring multiple index lookups [23].Moreover, semistructured data management systems donot have the benefit of the extensive experience gainedwith relational systems over the past few decades.To solve this problem, we have developed a differentapproach that leverages existing relational databasetechnology but provides much better performance thanprevious approaches. Our method encodes paths in thedata as strings, and inserts these strings into an index thatis highly optimized for string searching. The index blocksand semistructured data are both stored in a conventionalrelational database system. Evaluating queries involvesencoding the desired path traversal as a search key string,and performing a lookup in our index to find the path.Permission to copy without fee all or part of this material is grantedprovided that the copies are not made or distributed for directcommercial advantage, the VLDB copyright notice and the title of thepublication and its date appear, and notice is given that copying is bypermission of the Very Large Data Base Endowment. To copyotherwise, or to republish, requires a fee and/or special permission fromthe EndowmentProceedings of the 27th VLDB Conference,Roma, Italy, 2001There are several advantages to this approach. First, thereis no need for aprioriknowledge of the schema of thedata, since the paths we encode are extracted from thedata itself. Second, our approach has high performanceeven when the structure of the data is changing, variableor irregular. Third, the same index can accelerate queriesalong many different, complex access paths. This isbecause our indexing mechanism scales gracefully withthe number of keys inserted, and is not affected by long orcomplex keys (representing long or complex paths).Our indexing mechanism, called the Index Fabric,utilizes the aggressive key compression inherent in aPatricia trie [21] to index a large number of strings in acompact and efficient structure. Moreover, the IndexFabric is inherently balanced, so that all accesses to theindex require the same small number of I/Os. As a result,we can index a large, complex, irregularly-structured,disk-resident semistructured data set while providingefficient navigation over paths in the data.We manage two types of paths for semistructureddata. First, we can index paths that exist in the raw data(called raw paths) to accelerate any ad hoc query. We canalso reorganize portions of the data, to create refinedpaths, in order to better optimize particular queries. Bothkinds of paths are encoded as strings and inserted into theIndex Fabric. Because the index grows so slowly as weadd new keys, we can create many refined paths and thusoptimize many access patterns, even complex patternsthat traditional techniques cannot easily handle. As aresult, we can answer general queries efficiently usingraw paths, even as we further optimize certain queriesusing refined paths. Maintaining all of the paths in thesame index structure reduces the resource contention thatoccurs with multiple indexes, and provides a uniformmechanism that can be tuned for different needs.Although our implementation of the Index Fabricuses a commercial relational DBMS, our techniques donot dictate a particular storage architecture. In fact,


View Full Document

Duke CPS 296.1 - A Fast Index for Semistructured Data

Documents in this Course
Lecture

Lecture

18 pages

Lecture

Lecture

6 pages

Lecture

Lecture

13 pages

Lecture

Lecture

5 pages

Load more
Download A Fast Index for Semistructured Data
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view A Fast Index for Semistructured Data and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view A Fast Index for Semistructured Data 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?