Stanford CS 276 - Study Notes - D2720154

Home> Schools> Stanford University> Computer Science (CS) > CS 276> Study Notes

DOC PREVIEW

Stanford CS 276 - Study Notes

School name Stanford University

Course Cs 276- Information Retrieval and Web Search

Pages 11

This preview shows page 1-2-3-4 out of 11 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 11 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 11 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 11 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 11 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 11 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

ABSTRACTINTRODUCTIONRELATED WORKPRELIMINARIESGeneral Implementation InformationBenchmarkTURBOXPATH ANALYSISAlgorithm correctionsClosing child elementsRecursive documentsArray boundsCorrected versionEvaluation extensionMultiple expressionsCOMPARATIVE ANALYSISFunctionalityPredicate evaluationPerformanceSpeedMemory usageDocument sizeDocument depthQuery types5.2.4.1. Descendant axis5.2.4.2. ‘Any’ node test5.2.4.3. Predicate evaluationFUTURE DIRECTIONSBackward axis processingFunction supportExpression typesPredicate pipeliningCONCLUSIONREFERENCESAPPENDIX 1: DEMOUsageFilesbinsrcAPPENDIX 2: XPATHMARK QUERIESCS276B Project Report: Streaming XPath Engine Amruta Joshi Oleg Slezberg ABSTRACT XPath [1] has received a lot of attention in research during the last five years. It is widely used both as a standalone language and as a core component of XQuery [2] and XSLT [3]. Our project (titled xstream) concentrated on evaluation of XPath over XML streams. This research area contains multiple challenges resulting from both the richness of the language and the requirement of having only a single pass over the data. We modified and extended one of the known algorithms, TurboXPath [4], a tree-based IBM algorithm. We also provide extensive comparative analysis between TurboXPath and XSQ [5], currently the most advanced of finite automata (FA)-based algorithms. 1. INTRODUCTION Querying XML streams has a variety of applications. In some, data occurs naturally in the streaming form (e.g. stock quotes, news feeds, XML routing). In others, streaming is beneficial for the performance reasons, since the data is accessed using a single sequential scan. Whatever the reason, evaluation of XML streams using XPath poses many challenges. First, the evaluation must be done in one pass over the data. Secondly, XPath is a rich functional language, whose implementation is a nontrivial task. In addition to that, many XPath features such as descendant axis, predicate evaluation, and wildcards require elaborate algorithms in order to be processed efficiently. A generic XPath query can be represented as a sequence of location steps, where each step contains axis, node test, and zero or more predicates associated with it. A last location step is called an output expression; this is the expression that answers the query. For example, in a query //book[price < 30]/title we are querying for titles of all books priced under $30. The location steps here are //book[price < 30] (axis is //, node test is book predicate is [price < 30]) and title (axis is /, node test is title, no predicate). The output expression here is title. 2. RELATED WORK At the dawn of the XPath streaming evaluation, the researchers tried to solve the problem by building a finite automaton out of XPath query and running the SAX parser events through this automaton. Different forms of FA were tried – DFA [14] and NFA [15], built during preprocessing or lazily. The most recent of the FA algorithms is XSQ. It uses a sophisticated XPDT (eXtended PushDown Transducer) model for its hierarchical automaton, due to which it claims to provide the highest level of the XPath language support. On the other hand, TurboXPath is a tree-based XPath streaming algorithm. It first builds parse tree (PT) for the input query and then keeps matching the streamed document against the PT nodes. It uses smart matching arrays during the matching, avoiding exponential memory usage typical for FA algorithms. The name ‘TurboXPath’ is associated with several versions of the algorithm. It was first used in [10], then in [9], and finally in [4]. We use version [4] in our project since it is the most recent of all of the above. 3. PRELIMINARIES 3.1. General Implementation Information We implemented TurboXPath in Java using JDK 1.4.2. All our testing was done on Solaris 2.8. We used Apache Xerces 2.6.2 [6] for both XPath and XML parsers. Since XSQ implementation is also in Java, this allowed us to provide a more precise performance comparison between the implementations. The earlier benchmarks in [4] favor TurboXPath based on a comparison between its C++ and XSQ Java implementations. 3.2. Benchmark We chose XPathMark [7] as our benchmark for both functional and performance analysis. Note that the link [7] has changed after we started our project, so we use a slightly older version containing 23 queries. See Appendix 2 for the full list of queries. The benchmark comes with an XML document, containing the data to run the queries on and the file of answers. The document contains 60K of data (785 elements, depth = 11) modeling an Internet auction Web site. Our first intention was to use XMark [8] for our analysis. However, the investigation showed that XMark is primarily an XQuery benchmark and does not provide an adequate XPath coverage. On other hand, XPathMark exercises a broad variety of XPath axes, predicates, and functions. As we will see later in the document, XPathMark is a very predicate-intensive benchmark, whose 73% of the queries contain predicates. We do see it as a plus. In the real world, the queries will likely be complex, since they will likely be generated by application software and not typed 1by an end-user. Hence having higher level of complexity in the benchmark makes it closer to the real-life applications. 4. TURBOXPATH ANALYSIS 4.1. Algorithm corrections We started with the following pseudo-code from [4]: Function startElement(x) 1: maxIndex := nextIndex - 1 2: for i := 0 to maxIndex do 3: u := pointerArray[i] 4: if ((ntest(u) = label(x)) or (ntest(u) = *)) then 5: if ((axis(u) = descendant) or 6: (levelArray[i] = currentLevel)) then 7: if (not validationArray[i]) then 8: for c in children(u) do 9: if ((c = firstSibling(c)) and 10: (recursionArray[u] = 0)) then 11: pointerArray[i] := c 12: levelArray[i]:=currentLevel+1 13: else 14: pointerArray[nextIndex] := c 15: levelArray[nextIndex] :=currentLevel+1 16: validationArray[nextIndex]:=0 17: nextIndex := nextIndex + 1 18: recursionArray[u]:=recursionArray[u]+1 19: currentLevel := currentLevel + 1 Function endElement(x) 1: currentLevel := currentLevel - 1 2: i := nextIndex - 1 3: while (levelArray[i] > currentLevel) do 4: u := pointerArray[i] 5: if ((u = firstSibling(u) and 6: (recursionArray[parent(u)] = 0)) then 7: pointerArray[i] := parent(u)

View Full Document


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-3-4 out of 11 pages.

Stanford CS 276 - Study Notes

Sign up for free to view:

Please select your school