1CS276BText Retrieval and MiningWinter 2005Lecture 12What is XML?eXtensible Markup LanguageA framework for defining markup languagesNo fixed collection of markup tagsEach XML language targeted for applicationAll XML languages share featuresEnables building of generic toolsBasic StructureAn XML document is an ordered, labeled treecharacter data leaf nodes contain the actual data (text strings) element nodes, are each labeled with a name (often called the element type), and a set of attributes, each consisting of a name and a value, can have child nodes XML ExampleXML Example<chapter id="cmds"> <chaptitle>FileCab</chaptitle> <para>This chapter describes the commands that manage the <tm>FileCab</tm>inetapplication.</para> </chapter> ElementsElements are denoted by markup tags<foo attr1=“value” … > thetext </foo>Element start tag: fooAttribute: attr1The character data: thetextMatching element end tag: </foo>2XML vs HTMLHTML is a markup language for a specific purpose (display in browsers)XML is a framework for defining markup languagesHTML can be formalized as an XML language (XHTML)XML defines logical structure onlyHTML: same intention, but has evolved into a presentation languageXML: Design GoalsSeparate syntax from semantics to provide a common framework for structuring information Allow tailor-made markup for any imaginable application domain Support internationalization (Unicode) and platform independenceBe the future of (semi)structuredinformation (do some of the work now done by databases)Why Use XML?Represent semi-structured data (data that are structured, but don’t fit relational model)XML is more flexible than DBsXML is more structured than simple IRYou get a massive infrastructure for freeApplications of XMLXHTMLCML – chemical markup languageWML – wireless markup languageThML – theological markup language<h3 class="s05" id="One.2.p0.2">Having a Humble Opinion of Self</h3> <p class="First" id="One.2.p0.3">EVERY man naturally desires knowledge <note place="foot" id="One.2.p0.4"> <p class="Footnote" id="One.2.p0.5"><added id="One.2.p0.6"> <name id="One.2.p0.7">Aristotle</name>, Metaphysics, i. 1. </added></p> </note>; but what good is knowledge without fear of God? Indeed a humble rustic who serves God is better than a proud intellectual who neglects his soul to study the course of the stars. <added id="One.2.p0.8"><note place="foot" id="One.2.p0.9"> <p class="Footnote" id="One.2.p0.10"> Augustine, Confessions V. 4. </p> </note></added> </p> XML SchemasSchema = syntax definition of XML languageSchema language = formal language for expressing XML schemasExamplesDocument Type DefinitionXML Schema (W3C)Relevance for XML IROur job is much easier if we have a (one) schemaXML Tutorialhttp://www.brics.dk/~amoeller/XML/index.html(Anders Møller and Michael Schwartzbach) Previous (and some following) slides are based on their tutorial3XML Indexing and SearchNative XML DatabaseUses XML document as logical unitShould supportElementsAttributesPCDATA (parsed character data)Document orderContrast withDB modified for XMLGeneric IR system modified for XMLXML Indexing and SearchMost native XML databases have taken a DB approachExact matchEvaluate path expressionsNo IR type relevance rankingOnly a few that focus on relevance rankingData vs. Text-centric XMLData-centric XML: used for messaging between enterprise applicationsMainly a recasting of relational dataContent-centric XML: used for annotating contentRich in textDemands good integration of text retrieval functionalityE.g., find me the ISBN #s of Books with at least three Chapters discussing cocoa production, ranked by PriceIR XML Challenge 1: Term StatisticsThere is no document unit in XMLHow do we compute tf and idf?Global tf/idf over all text context is uselessIndexing granularityIR XML Challenge 2: FragmentsIR systems don’t store content (only index)Need to go to document for retrieving/displaying fragmentE.g., give me the Abstracts of Papers on existentialismWhere do you retrieve the Abstract from?Easier in DB framework4IR XML Challenges 3: SchemasIdeally:There is one schemaUser understands schemaIn practice: rareMany schemasSchemas not known in advanceSchemas changeUsers don’t understand schemasNeed to identify similar elements in different schemasExample: employeeIR XML Challenges 4: UIHelp user find relevant nodes in schemaAuthor, editor, contributor, “from:”/senderWhat is the query language you expose to the user?Specific XML query language? No.Forms? Parametric search?A textbox?In general: design layer between XML and userIR XML Challenges 5: using a DBWhy you don’t want to use a DBSpelling correctionMid-word wildcardsContains vs “is about”DB has no notion of orderingRelevance rankingQuerying XMLToday:XQueryXIRQLLecture 15Vector space approachesXQuerySQL for XMLUsage scenariosHuman-readable documentsData-oriented documentsMixed documents (e.g., patient records)Relies onXPathXML Schema datatypesTuring completeXQuery is still a working draft.XQueryThe principal forms of XQuery expressions are: path expressions element constructors FLWR ("flower") expressions list expressions conditional expressions quantified expressions datatype expressions Evaluated with respect to a context5FLWRFOR $p IN document("bib.xml")//publisher LET $b := document("bib.xml”)//book[publisher = $p] WHERE count($b) > 100 RETURN $p FOR generates an ordered list of bindings of publisher names to $p LET associates to each binding a further binding of the list of book elements with that publisher to $b at this stage, we have an ordered list of tuples of bindings: ($p,$b) WHERE filters that list to retain only the desired tuplesRETURN constructs for each tuple a resulting value Queries Supported by XQueryLocation/position (“chapter no.3”)Simple attribute/value/play/title contains “hamlet”Path queriestitle contains “hamlet”/play//title contains “hamlet”Complex graphsEmployees with two managersSubsumes: hyperlinksWhat about relevance ranking?How XQuery makes ranking difficultAll documents in set A must be
View Full Document