New version page

Stanford CS 276B - What is XML

Upgrade to remove ads
Upgrade to remove ads
Unformatted text preview:

CS276B Text Retrieval and Mining Winter 2005What is XML?Basic StructureXML ExampleSlide 5ElementsXML vs HTMLXML: Design GoalsWhy Use XML?Applications of XMLXML SchemasXML TutorialXML Indexing and SearchNative XML DatabaseSlide 15Data vs. Text-centric XMLIR XML Challenge 1: Term StatisticsIR XML Challenge 2: FragmentsIR XML Challenges 3: SchemasIR XML Challenges 4: UIIR XML Challenges 5: using a DBQuerying XMLXQuerySlide 24FLWRQueries Supported by XQueryHow XQuery makes ranking difficultXQuery: Order By ClauseXQuery Order By ClauseXIRQLAtomic UnitsXIRQL IndexingStructured Document Retrieval PrincipleAugmentation weightsDatatypesXIRQL: SummaryData structures for XML retrievalSlide 38Parent/child linksGeneral positional indexesPositional containmentSummary of data structuresResourcesCS276BText Retrieval and MiningWinter 2005Lecture 12What is XML?eXtensible Markup LanguageA framework for defining markup languagesNo fixed collection of markup tagsEach XML language targeted for applicationAll XML languages share featuresEnables building of generic toolsBasic StructureAn XML document is an ordered, labeled tree character data leaf nodes contain the actual data (text strings) element nodes, are each labeled with a name (often called the element type), and a set of attributes, each consisting of a name and a value, can have child nodesXML ExampleXML Example<chapter id="cmds"> <chaptitle>FileCab</chaptitle> <para>This chapter describes the commands that manage the <tm>FileCab</tm>inet application.</para> </chapter>ElementsElements are denoted by markup tags<foo attr1=“value” … > thetext </foo>Element start tag: fooAttribute: attr1The character data: thetextMatching element end tag: </foo>XML vs HTMLHTML is a markup language for a specific purpose (display in browsers)XML is a framework for defining markup languagesHTML can be formalized as an XML language (XHTML)XML defines logical structure onlyHTML: same intention, but has evolved into a presentation languageXML: Design GoalsSeparate syntax from semantics to provide a common framework for structuring information Allow tailor-made markup for any imaginable application domain Support internationalization (Unicode) and platform independence Be the future of (semi)structured information (do some of the work now done by databases)Why Use XML?Represent semi-structured data (data that are structured, but don’t fit relational model)XML is more flexible than DBsXML is more structured than simple IRYou get a massive infrastructure for freeApplications of XMLXHTMLCML – chemical markup languageWML – wireless markup languageThML – theological markup language<h3 class="s05" id="One.2.p0.2">Having a Humble Opinion of Self</h3> <p class="First" id="One.2.p0.3">EVERY man naturally desires knowledge <note place="foot" id="One.2.p0.4"> <p class="Footnote" id="One.2.p0.5"><added id="One.2.p0.6"> <name id="One.2.p0.7">Aristotle</name>, Metaphysics, i. 1. </added></p> </note>; but what good is knowledge without fear of God? Indeed a humble rustic who serves God is better than a proud intellectual who neglects his soul to study the course of the stars. <added id="One.2.p0.8"><note place="foot" id="One.2.p0.9"> <p class="Footnote" id="One.2.p0.10"> Augustine, Confessions V. 4. </p> </note></added> </p>XML SchemasSchema = syntax definition of XML languageSchema language = formal language for expressing XML schemasExamplesDocument Type DefinitionXML Schema (W3C)Relevance for XML IROur job is much easier if we have a (one) schemaXML Tutorialhttp://www.brics.dk/~amoeller/XML/index.html(Anders Møller and Michael Schwartzbach) Previous (and some following) slides are based on their tutorialXML Indexing and SearchNative XML DatabaseUses XML document as logical unitShould supportElementsAttributesPCDATA (parsed character data)Document orderContrast withDB modified for XMLGeneric IR system modified for XMLXML Indexing and SearchMost native XML databases have taken a DB approachExact matchEvaluate path expressionsNo IR type relevance rankingOnly a few that focus on relevance rankingData vs. Text-centric XMLData-centric XML: used for messaging between enterprise applicationsMainly a recasting of relational dataContent-centric XML: used for annotating contentRich in textDemands good integration of text retrieval functionalityE.g., find me the ISBN #s of Books with at least three Chapters discussing cocoa production, ranked by PriceIR XML Challenge 1: Term StatisticsThere is no document unit in XMLHow do we compute tf and idf?Global tf/idf over all text context is uselessIndexing granularityIR XML Challenge 2: FragmentsIR systems don’t store content (only index)Need to go to document for retrieving/displaying fragmentE.g., give me the Abstracts of Papers on existentialismWhere do you retrieve the Abstract from?Easier in DB frameworkIR XML Challenges 3: SchemasIdeally:There is one schemaUser understands schemaIn practice: rareMany schemasSchemas not known in advanceSchemas changeUsers don’t understand schemasNeed to identify similar elements in different schemasExample: employeeIR XML Challenges 4: UIHelp user find relevant nodes in schemaAuthor, editor, contributor, “from:”/senderWhat is the query language you expose to the user?Specific XML query language? No.Forms? Parametric search?A textbox?In general: design layer between XML and userIR XML Challenges 5: using a DBWhy you don’t want to use a DBSpelling correctionMid-word wildcardsContains vs “is about”DB has no notion of orderingRelevance rankingQuerying XMLToday:XQueryXIRQLLecture 15Vector space approachesXQuerySQL for XMLUsage scenariosHuman-readable documentsData-oriented documentsMixed documents (e.g., patient records)Relies onXPathXML Schema datatypesTuring completeXQuery is still a working draft.XQueryThe principal forms of XQuery expressions are: path expressions element constructors FLWR ("flower") expressions list expressions conditional expressions quantified expressions datatype expressions Evaluated with respect to a contextFLWRFOR $p IN document("bib.xml")//publisher LET $b := document("bib.xml”)//book[publisher


View Full Document
Download What is XML
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view What is XML and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view What is XML 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?