Unformatted text preview:

1CS276BText Retrieval and MiningWinter 2005Lecture 15Plan for todayVector space approaches to XML retrievalEvaluating text-centric retrievalText-centric XML retrievalDocuments marked up as XMLE.g., assembly manuals, journal issues …Queries are user information needs E.g., give me the Section (element) of the document that tells me how to change a brake lightDifferent from well-structured XML queries where you tightly specify what you’re looking for.Vector spaces and XMLVector spaces – tried+tested framework for keyword retrievalOther “bag of words” applications in text: classification, clustering …For text-centric XML retrieval, can we make use of vector space ideas?Challenge: capture the structure of an XML document in the vector space.Vector spaces and XMLFor instance, distinguish between the following two casesBookTitleAuthorBill GatesMicrosoftBookTitleAuthorBill WulfThe PearlyGatesContent-rich XML: representationBookTitleAuthorBillMicrosoftBookTitleAuthorWulfPearlyGatesGatesTheBillLexicon terms.2Encoding the Gates differentlyWhat are the axes of the vector space?In text retrieval, there would be a single axis for GatesHere we must separate out the two occurrences, under Author and TitleThus, axes must represent not only terms, but something about their position in an XML treeQueriesBefore addressing this, let us consider the kinds of queries we want to handleBookTitleMicrosoftBookTitleAuthorGatesBillQuery typesThe preceding examples can be viewed as subtrees of the documentBut what about?(Gatessomewhere underneath Book)This is harder and we will return to it later.BookGatesSubtrees and structureConsider all subtrees of the document that include at least one lexicon term:BookTitleAuthorBillMicrosoft GatesBillMicrosoftGatesTitleMicrosoftAuthorBillAuthorGatesBookTitleMicrosoftBillBookAuthorGatese.g.…Structural termsCall each of the resulting (8+, in the previous slide) subtrees a structural termNote that structural terms might occur multiple times in a documentCreate one axis in the vector space for each distinct structural termWeights based on frequencies for number of occurrences (just as we had tf)All the usual issues with terms (stemming? Case folding?) remainExample of tf weightingHere the structural terms containing toor bewould have more weight than those that don’tPlayActTo be or not to bePlayActbePlayActorPlayActnotPlayActtoExercise: How many axes are there in this example?3Down-weightingFor the doc on the left: in a structural term rooted at the node Play, shouldn’t Hamlethave a higher tf weight than Yorick?Idea: multiply tfcontribution of a term to a node k levels up by γk, for some γ < 1.PlayActAlas poor YorickSceneTitleHamletDown-weighting example, γ=0.8For the doc on the previous slide, the tf ofHamletis multiplied by 0.8Yor ickis multiplied by 0.64in any structural term rooted at Play.The number of structural termsCan be huge!Impractical to build a vector space index with so many dimensionsWill examine pragmatic solutions to this shortly; for now, continue to believe …Alright, how huge, really?Structural terms: docs+queriesThe notion of structural terms is independent of any schema/DTD for the XML documentsWell-suited to a heterogeneous collection of XML documentsEach document becomes a vector in the space of structural termsA query tree can likewise be factored into structural termsAnd represented as a vectorAllows weighting portions of the queryExample queryBookTitleAuthorGates Bill0.6 0.4TitleAuthorGates Bill0.6 0.4BookTitleGates0.6BookAuthorBill0.4…Weight propagationThe assignment of the weights 0.6 and 0.4 in the previous example to subtrees was simplisticCan be more sophisticatedThink of it as generated by an application, not necessarily an end-userQueries, documents become normalized vectorsRetrieval score computation “just” a matter of cosine similarity computation4Restrict structural terms?Depending on the application, we may restrict the structural termsE.g., may never want to return a Title node, only Book or Play nodesSo don’t enumerate/index/retrieve/score structural terms rooted at some nodesThe catch remainsThis is all very promising, but …How big is this vector space?Can be exponentially large in the size of the documentCannot hope to build such an indexAnd in any case, still fails to answer queries likeBookGates(somewhere underneath)Two solutionsQuery-time materialization of axesRestrict the kinds of subtrees to a manageable setQuery-time materializationInstead of enumerating all structural terms of all docs (and the query), enumerate only for the queryThe latter is hopefully a small setNow, we’re reduced to checking which structural term(s) from the query match a subtree of any documentThis is tree pattern matching: given a text tree and a pattern tree, find matchesExcept we have many text treesOur trees are labeled and weightedExampleHere we seek a doc with Hamletin the titleOn finding the match we compute the cosine similarity scoreAfter all matches are found, rank by sortingPlayActAlas poor YorickSceneText =Query =HamletTitleHamletTitle(Still infeasible)A doc with Yoricksomewhere in it:Query =Will get to it …YorickTitle5Restricting the subtreesEnumerating all structural terms (subtrees) is prohibitive, for indexingMost subtrees may never be used in processing any queryCan we get away with indexing a restricted class of subtreesIdeally – focus on subtrees likely to arise in queriesJuruXML (IBM Haifa)Only paths including a lexicon termIn this example there are only 14 (why?) such pathsThus we have 14 structural terms in the indexPlayActTo be or not to beSceneTitleHamletWhy is this far more manageable?How big can the index be as a function of the text?VariationsCould have used other subtrees – e.g., all subtrees with two siblings under a nodeWhich subtrees get used: depends on the likely queries in the applicationCould be specified at index time – area with little research so farBookTitleAuthorBillMicrosoft GatesBookTitleAuthorBillMicrosoft2 termsGatesVariationsWhy would this be any different from just paths?Because we preserve more of the structure that a query may seekBookTitleAuthorBillMicrosoftTitleAuthorGates


View Full Document

Stanford CS 276B - Text Retrieval and Mining

Download Text Retrieval and Mining
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Text Retrieval and Mining and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Text Retrieval and Mining 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?