1CS276BText Retrieval and MiningWinter 2005Lecture 15Plan for todayVector space approaches to XML retrievalEvaluating text-centric retrievalText-centric XML retrievalDocuments marked up as XMLE.g., assembly manuals, journal issues …Queries are user information needs E.g., give me the Section (element) of the document that tells me how to change a brake lightDifferent from well-structured XML queries where you tightly specify what you’re looking for.Vector spaces and XMLVector spaces – tried+tested framework for keyword retrievalOther “bag of words” applications in text: classification, clustering …For text-centric XML retrieval, can we make use of vector space ideas?Challenge: capture the structure of an XML document in the vector space.Vector spaces and XMLFor instance, distinguish between the following two casesBookTitleAuthorBill GatesMicrosoftBookTitleAuthorBill WulfThe PearlyGatesContent-rich XML: representationBookTitleAuthorBillMicrosoftBookTitleAuthorWulfPearlyGatesGatesTheBillLexicon terms.2Encoding the Gates differentlyWhat are the axes of the vector space?In text retrieval, there would be a single axis for GatesHere we must separate out the two occurrences, under Author and TitleThus, axes must represent not only terms, but something about their position in an XML treeQueriesBefore addressing this, let us consider the kinds of queries we want to handleBookTitleMicrosoftBookTitleAuthorGatesBillQuery typesThe preceding examples can be viewed as subtrees of the documentBut what about?(Gatessomewhere underneath Book)This is harder and we will return to it later.BookGatesSubtrees and structureConsider all subtrees of the document that include at least one lexicon term:BookTitleAuthorBillMicrosoft GatesBillMicrosoftGatesTitleMicrosoftAuthorBillAuthorGatesBookTitleMicrosoftBillBookAuthorGatese.g.…Structural termsCall each of the resulting (8+, in the previous slide) subtrees a structural termNote that structural terms might occur multiple times in a documentCreate one axis in the vector space for each distinct structural termWeights based on frequencies for number of occurrences (just as we had tf)All the usual issues with terms (stemming? Case folding?) remainExample of tf weightingHere the structural terms containing toor bewould have more weight than those that don’tPlayActTo be or not to bePlayActbePlayActorPlayActnotPlayActtoExercise: How many axes are there in this example?3Down-weightingFor the doc on the left: in a structural term rooted at the node Play, shouldn’t Hamlethave a higher tf weight than Yorick?Idea: multiply tfcontribution of a term to a node k levels up by γk, for some γ < 1.PlayActAlas poor YorickSceneTitleHamletDown-weighting example, γ=0.8For the doc on the previous slide, the tf ofHamletis multiplied by 0.8Yor ickis multiplied by 0.64in any structural term rooted at Play.The number of structural termsCan be huge!Impractical to build a vector space index with so many dimensionsWill examine pragmatic solutions to this shortly; for now, continue to believe …Alright, how huge, really?Structural terms: docs+queriesThe notion of structural terms is independent of any schema/DTD for the XML documentsWell-suited to a heterogeneous collection of XML documentsEach document becomes a vector in the space of structural termsA query tree can likewise be factored into structural termsAnd represented as a vectorAllows weighting portions of the queryExample queryBookTitleAuthorGates Bill0.6 0.4TitleAuthorGates Bill0.6 0.4BookTitleGates0.6BookAuthorBill0.4…Weight propagationThe assignment of the weights 0.6 and 0.4 in the previous example to subtrees was simplisticCan be more sophisticatedThink of it as generated by an application, not necessarily an end-userQueries, documents become normalized vectorsRetrieval score computation “just” a matter of cosine similarity computation4Restrict structural terms?Depending on the application, we may restrict the structural termsE.g., may never want to return a Title node, only Book or Play nodesSo don’t enumerate/index/retrieve/score structural terms rooted at some nodesThe catch remainsThis is all very promising, but …How big is this vector space?Can be exponentially large in the size of the documentCannot hope to build such an indexAnd in any case, still fails to answer queries likeBookGates(somewhere underneath)Two solutionsQuery-time materialization of axesRestrict the kinds of subtrees to a manageable setQuery-time materializationInstead of enumerating all structural terms of all docs (and the query), enumerate only for the queryThe latter is hopefully a small setNow, we’re reduced to checking which structural term(s) from the query match a subtree of any documentThis is tree pattern matching: given a text tree and a pattern tree, find matchesExcept we have many text treesOur trees are labeled and weightedExampleHere we seek a doc with Hamletin the titleOn finding the match we compute the cosine similarity scoreAfter all matches are found, rank by sortingPlayActAlas poor YorickSceneText =Query =HamletTitleHamletTitle(Still infeasible)A doc with Yoricksomewhere in it:Query =Will get to it …YorickTitle5Restricting the subtreesEnumerating all structural terms (subtrees) is prohibitive, for indexingMost subtrees may never be used in processing any queryCan we get away with indexing a restricted class of subtreesIdeally – focus on subtrees likely to arise in queriesJuruXML (IBM Haifa)Only paths including a lexicon termIn this example there are only 14 (why?) such pathsThus we have 14 structural terms in the indexPlayActTo be or not to beSceneTitleHamletWhy is this far more manageable?How big can the index be as a function of the text?VariationsCould have used other subtrees – e.g., all subtrees with two siblings under a nodeWhich subtrees get used: depends on the likely queries in the applicationCould be specified at index time – area with little research so farBookTitleAuthorBillMicrosoft GatesBookTitleAuthorBillMicrosoft2 termsGatesVariationsWhy would this be any different from just paths?Because we preserve more of the structure that a query may seekBookTitleAuthorBillMicrosoftTitleAuthorGates
View Full Document