CSCI 5417 Information Retrieval Systems Jim MartinToday 8/30Index: Dictionary and PostingsBoolean AND: Intersection (1)Boolean AND: Intersection (2)Review: DictionaryDictionaryA Naïve DictionaryDictionary Data StructuresHashesBinary Tree ApproachTree: B-treeTreesBack to Query ProcessingPhrasal queriesPositional IndexingPositional index exampleProcessing a phrase queryRules of thumbWild Card QueriesSimple Single Wild-card Queries: *Query processingArbitrary WildcardsPermuterm IndexPermuterm indexPermuterm query processingNotice...Programming Assignment 1Spelling CorrectionDocument correctionGoogle BooksSlide 32Query correctionIsolated word correctionSlide 35Edit distanceWeighted edit distanceUsing edit distancesEdit distance to all dictionary terms?Context-sensitive spell correctionContext-sensitive correctionGeneral issue in spell correctionNext TimeCSCI 5417Information Retrieval SystemsJim MartinLecture 38/30/2010CSCI 5417 - IRToday 8/30ReviewConjunctive queries (intersect)Dictionary contentsPhrasal queriesTolerant query handlingWildcardsSpelling correctionCSCI 5417 - IRDoc # Freq2 12 11 12 11 11 12 21 11 12 11 21 12 11 11 22 11 12 12 11 12 12 12 11 12 12 1 Term N docs Coll freqambitious 1 1be 1 1brutus 2 2capitol 1 1caesar 2 3did 1 1enact 1 1hath 1 1I 1 2i' 1 1it 1 1julius 1 1killed 1 2let 1 1me 1 1noble 1 1so 1 1the 2 2told 1 1you 1 1was 2 2with 1 1Term Doc # Freqambitious 2 1be 2 1brutus 1 1brutus 2 1capitol 1 1caesar 1 1caesar 2 2did 1 1enact 1 1hath 2 1I 1 2i' 1 1it 2 1julius 1 1killed 1 2let 2 1me 1 1noble 2 1so 2 1the 1 1the 2 1told 2 1you 2 1was 1 1was 2 1with 2 1Index: Dictionary and PostingsCSCI 5417 - IRBoolean AND: Intersection (1)CSCI 5417 - IRBoolean AND: Intersection (2)CSCI 5417 - IRReview: DictionaryWhat goes into creating the terms that make it into the dictionary?TokenizationCase foldingStemmingStop-listingNormalizationDealing with numbers (and number-like entities)Complex morphologyDictionaryThe dictionary data structure stores the term vocabulary, document frequency, and pointers to each postings list. In what kind of data structure?CSCI 5417 - IRA Naïve DictionaryAn array of structs? char[20] int postings * 20 bytes 4/8 bytes 4/8 bytes How do we quickly look up elements at query time?CSCI 5417 - IRDictionary Data StructuresTwo main choices:Hash tablesTreesSome IR systems use hashes, some trees. Choice depends on the application details.CSCI 5417 - IRHashesEach vocabulary term is hashed to an integerI assume you’ve seen hashtables beforePros:Lookup is faster than for a tree: O(1)Cons:No easy way to find minor variants:judgment/judgementNo prefix search [tolerant retrieval]If vocabulary keeps growing, need to occasionally rehash everythingCSCI 5417 - IRRoota-mn-za-hu hy-m n-sh si-zaardvarkhuygenssicklezygotBinary Tree ApproachSec. 3.1CSCI 5417 - IRTree: B-treeDefinition: Every internal nodel has a number of children in the interval [a,b] where a, b are appropriate natural numbers, e.g., [2,4].a-huhy-mn-zCSCI 5417 - IRTreesSimplest approach: binary treesMore typical : B-treesTrees require a standard ordering of characters and hence strings … but we have thatPros:Facilitates prefix processing (terms starting with hyp)Google’s “search as you type”Cons:Slower: O(log M) [and this requires balanced tree]Rebalancing binary trees is expensiveBut B-trees mitigate the rebalancing problemCSCI 5417 - IRBack to Query ProcessingUsers are so demanding...In addition to phrasal queries, they like to Use wild-card queriesMisspell stufSo we better see what we can do about those thingsCSCI 5417 - IRCSCI 5417 - IRPhrasal queriesWant to handle queries such as “Colorado Buffaloes” – as a phraseThis concept is popular with users; about 10% of ad hoc web queries are phrasal queriesPostings that consist of document lists alone are not sufficient to handle phrasal queriesTwo general approachesWord N-gram indexingPositional indexingCSCI 5417 - IRPositional IndexingChange the content of the postingsStore, for each term, entries of the form:<number of docs containing term;doc1: position1, position2 … ;doc2: position1, position2 … ;etc.>CSCI 5417 - IRPositional index example<be: 993427;1: 7, 18, 33, 72, 86, 231;2: 3, 149;4: 17, 191, 291, 430, 434;5: 363, 367, …>Which of docs 1,2,4,5could contain “to beor not to be”?CSCI 5417 - IRProcessing a phrase queryExtract postings for each distinct term: to, be, or, not.Merge their doc:position lists to enumerate all positions with “to be or not to be”.to: 2:1,17,74,222,551; 4:8,16,190,429,433; 7:13,23,191; ...be: 1:17,19; 4:17,191,291,430,434; 5:14,19,101; ...Same general method for proximity searches (“near” operator).CSCI 5417 - IRRules of thumbPositional index size 35–50% of volume of original textCaveat: all of this holds for “English-like” languagesCSCI 5417 - IRWild Card QueriesTwo flavorsWord-basedCaribb*Phrasal“Pirates * Caribbean”General approachSpawn a new set of queries from the original queryBasically a dictionary operationRun each of those queries in a not totally stupid wayCSCI 5417 - IRSimple Single Wild-card Queries: *Single instance of a ** means an string of length 0 or moreThis is not Kleene *.mon*: find all docs containing any word beginning “mon”.Using trees to implement the dictionary gives you prefixes*mon: find words ending in “mon”Maintain a backwards indexQuery processingAt this point, we have an enumeration of all terms in the dictionary that match the wild-card query.We still have to look up the postings for each enumerated term.For example, consider the querymon* AND octob*This results in the execution of many Boolean AND queries.CSCI 5417 - IRCSCI 5417 - IRArbitrary WildcardsHow can we handle *’s in the middle of query term?The solution: transform every possible wild-card query so that the *’s occur at the endThis motivates the Permuterm IndexThe dictionary/tree scheme remains the same; but we populate the dictionary with extra (special) termsCSCI 5417 - IRPermuterm IndexFor the real term hello create entries under:hello$, ello$h, llo$he, lo$hel, o$hellwhere $ is a special symbol.ExampleQuery = hel*oAdd the $= hel*o$Rotate * to the backLookup o$hel*Permuterm
View Full Document