CS276B Text Retrieval and Mining Winter 2005What is XML?Basic StructureXML ExampleSlide 5ElementsXML vs HTMLXML: Design GoalsWhy Use XML?Applications of XMLXML SchemasXML TutorialXML Indexing and SearchNative XML DatabaseSlide 15Data vs. Text-centric XMLIR XML Challenge 1: Term StatisticsIR XML Challenge 2: FragmentsIR XML Challenges 3: SchemasIR XML Challenges 4: UIIR XML Challenges 5: using a DBQuerying XMLXQuerySlide 24FLWRQueries Supported by XQueryHow XQuery makes ranking difficultXQuery: Order By ClauseXQuery Order By ClauseXIRQLAtomic UnitsXIRQL IndexingStructured Document Retrieval PrincipleAugmentation weightsDatatypesXIRQL: SummaryData structures for XML retrievalSlide 38Parent/child linksGeneral positional indexesPositional containmentSummary of data structuresResourcesCS276BText Retrieval and MiningWinter 2005Lecture 12What is XML?eXtensible Markup LanguageA framework for defining markup languagesNo fixed collection of markup tagsEach XML language targeted for applicationAll XML languages share featuresEnables building of generic toolsBasic StructureAn XML document is an ordered, labeled tree character data leaf nodes contain the actual data (text strings) element nodes, are each labeled with a name (often called the element type), and a set of attributes, each consisting of a name and a value, can have child nodesXML ExampleXML Example<chapter id="cmds"> <chaptitle>FileCab</chaptitle> <para>This chapter describes the commands that manage the <tm>FileCab</tm>inet application.</para> </chapter>ElementsElements are denoted by markup tags<foo attr1=“value” … > thetext </foo>Element start tag: fooAttribute: attr1The character data: thetextMatching element end tag: </foo>XML vs HTMLHTML is a markup language for a specific purpose (display in browsers)XML is a framework for defining markup languagesHTML can be formalized as an XML language (XHTML)XML defines logical structure onlyHTML: same intention, but has evolved into a presentation languageXML: Design GoalsSeparate syntax from semantics to provide a common framework for structuring information Allow tailor-made markup for any imaginable application domain Support internationalization (Unicode) and platform independence Be the future of (semi)structured information (do some of the work now done by databases)Why Use XML?Represent semi-structured data (data that are structured, but don’t fit relational model)XML is more flexible than DBsXML is more structured than simple IRYou get a massive infrastructure for freeApplications of XMLXHTMLCML – chemical markup languageWML – wireless markup languageThML – theological markup language<h3 class="s05" id="One.2.p0.2">Having a Humble Opinion of Self</h3> <p class="First" id="One.2.p0.3">EVERY man naturally desires knowledge <note place="foot" id="One.2.p0.4"> <p class="Footnote" id="One.2.p0.5"><added id="One.2.p0.6"> <name id="One.2.p0.7">Aristotle</name>, Metaphysics, i. 1. </added></p> </note>; but what good is knowledge without fear of God? Indeed a humble rustic who serves God is better than a proud intellectual who neglects his soul to study the course of the stars. <added id="One.2.p0.8"><note place="foot" id="One.2.p0.9"> <p class="Footnote" id="One.2.p0.10"> Augustine, Confessions V. 4. </p> </note></added> </p>XML SchemasSchema = syntax definition of XML languageSchema language = formal language for expressing XML schemasExamplesDocument Type DefinitionXML Schema (W3C)Relevance for XML IROur job is much easier if we have a (one) schemaXML Tutorialhttp://www.brics.dk/~amoeller/XML/index.html(Anders Møller and Michael Schwartzbach) Previous (and some following) slides are based on their tutorialXML Indexing and SearchNative XML DatabaseUses XML document as logical unitShould supportElementsAttributesPCDATA (parsed character data)Document orderContrast withDB modified for XMLGeneric IR system modified for XMLXML Indexing and SearchMost native XML databases have taken a DB approachExact matchEvaluate path expressionsNo IR type relevance rankingOnly a few that focus on relevance rankingData vs. Text-centric XMLData-centric XML: used for messaging between enterprise applicationsMainly a recasting of relational dataContent-centric XML: used for annotating contentRich in textDemands good integration of text retrieval functionalityE.g., find me the ISBN #s of Books with at least three Chapters discussing cocoa production, ranked by PriceIR XML Challenge 1: Term StatisticsThere is no document unit in XMLHow do we compute tf and idf?Global tf/idf over all text context is uselessIndexing granularityIR XML Challenge 2: FragmentsIR systems don’t store content (only index)Need to go to document for retrieving/displaying fragmentE.g., give me the Abstracts of Papers on existentialismWhere do you retrieve the Abstract from?Easier in DB frameworkIR XML Challenges 3: SchemasIdeally:There is one schemaUser understands schemaIn practice: rareMany schemasSchemas not known in advanceSchemas changeUsers don’t understand schemasNeed to identify similar elements in different schemasExample: employeeIR XML Challenges 4: UIHelp user find relevant nodes in schemaAuthor, editor, contributor, “from:”/senderWhat is the query language you expose to the user?Specific XML query language? No.Forms? Parametric search?A textbox?In general: design layer between XML and userIR XML Challenges 5: using a DBWhy you don’t want to use a DBSpelling correctionMid-word wildcardsContains vs “is about”DB has no notion of orderingRelevance rankingQuerying XMLToday:XQueryXIRQLLecture 15Vector space approachesXQuerySQL for XMLUsage scenariosHuman-readable documentsData-oriented documentsMixed documents (e.g., patient records)Relies onXPathXML Schema datatypesTuring completeXQuery is still a working draft.XQueryThe principal forms of XQuery expressions are: path expressions element constructors FLWR ("flower") expressions list expressions conditional expressions quantified expressions datatype expressions Evaluated with respect to a contextFLWRFOR $p IN document("bib.xml")//publisher LET $b := document("bib.xml”)//book[publisher
View Full Document