Tolerant IRThis lectureDictionary data structures for inverted indexesA naïve dictionaryDictionary data structuresHashtablesTree: binary treeTree: B-treeTreesWild-card queriesWild-card queries: *Query processingB-trees handle *’s at the end of a query termPermuterm indexPermuterm query processingBigram indexesBigram index exampleProcessing n-gram wild-cardsProcessing wild-card queriesAdvanced featuresSpelling correctionSpell correctionDocument correctionQuery mis-spellingsIsolated word correctionSlide 26Edit distanceWeighted edit distanceUsing edit distancesEdit distance to all dictionary terms?n-gram overlapExample with trigramsOne option – Jaccard coefficientMatching trigramsContext-sensitive spell correctionContext-sensitive correctionGeneral issues in spell correctionComputational costPowerPoint PresentationSoundexSoundex – typical algorithmSlide 45Soundex continuedExerciseSlide 48What queries can we process?Aside – results cachingPrasad L05TolerantIR 1Tolerant IRAdapted from Lectures byPrabhakar Raghavan (Yahoo and Stanford) and Christopher Manning (Stanford)2This lecture“Tolerant” retrievalWild-card queriesSpelling correctionSoundexMotivationFor expressiveness to accommodate variants (e.g., automat*)To deal with incomplete information about spelling or multiple spellings (E.g., S*dney)Dictionary data structures for inverted indexesThe dictionary data structure stores the term vocabulary, document frequency, pointers to each postings list … in what data structure?Sec. 3.13A naïve dictionaryAn array of struct: char[20] int Postings * 20 bytes 4/8 bytes 4/8 bytes How do we store a dictionary in memory efficiently?How do we quickly look up elements at query time?Sec. 3.14Dictionary data structuresTwo main choices:HashtablesTreesSome IR systems use hashtables, some treesSec. 3.15HashtablesEach vocabulary term is hashed to an integer(We assume you’ve seen hashtables before)Pros:Lookup is faster than for a tree: O(1)Cons:No easy way to find minor variants:judgment/judgementNo prefix search [tolerant retrieval]If vocabulary keeps growing, need to occasionally do the expensive operation of rehashing everythingSec. 3.16Roota-mn-za-hu hy-m n-sh si-zaardvarkhuygenssicklezygotTree: binary treeSec. 3.17Tree: B-treeDefinition: Every internal nodel has a number of children in the interval [a,b] where a, b are appropriate natural numbers, e.g., [2,4].a-huhy-mn-zSec. 3.18TreesSimplest: binary treeMore usual: B-treesTrees require a standard ordering of characters and hence strings … but we typically have onePros:Solves the prefix problem (terms starting with hyp)Cons:Slower: O(log M) [and this requires balanced tree]Rebalancing binary trees is expensiveBut B-trees mitigate the rebalancing problemSec. 3.1910Wild-card queries11Wild-card queries: *mon*: find all docs containing any word beginning with “mon”.Hashing unsuitable because order not preservedEasy with binary tree (or B-tree) lexicon: retrieve all words in range: mon ≤ w < moo*mon: find words ending in “mon”: harderMaintain an additional B-tree for terms backwards.Can retrieve all words in range: nom ≤ w < non.Exercise: from this, how can we enumerate all termsmeeting the wild-card query pro*cent ?12Query processingAt this point, we have an enumeration of all terms in the dictionary that match the wild-card query.We still have to look up the postings for each enumerated term.E.g., consider the query:se*ate AND fil*erThis may result in the execution of many Boolean AND queries.13B-trees handle *’s at the end of a query termHow can we handle *’s in the middle of query term? (Especially multiple *’s)Consider co*tionWe could look up co* AND *tion in a B-tree and intersect the two term setsExpensiveThe solution: transform every wild-card query so that the *’s occur at the endThis gives rise to the Permuterm Index.14Permuterm indexFor term hello index under:hello$, ello$h, llo$he, lo$hel, o$hellwhere $ is a special symbol.Queries: X lookup on X$ *X lookup on X$**X* lookup on X*X*Y lookup on Y$X* X*Y*Z ??? Query = hel*oX=hel, Y=oLookup o$hel*15Permuterm query processingRotate query wild-card to the rightNow use B-tree lookup as before.Permuterm problem: ≈ quadruples lexicon sizeEmpirical observation for English.16Bigram indexesEnumerate all k-grams (sequence of k chars) occurring in any terme.g., from text “April is the cruelest month” we get the 2-grams (bigrams)$ is a special word boundary symbolMaintain a second “inverted” index from bigrams to dictionary terms that match each bigram.$a,ap,pr,ri,il,l$,$i,is,s$,$t,th,he,e$,$c,cr,ru,ue,el,le,es,st,t$, $m,mo,on,nt,h$17Bigram index examplemoonamong$m maceamongamortizemaddenaroundThe k-gram index finds terms based on a query consisting of k-grams (here k=2).18Processing n-gram wild-cardsQuery mon* can now be run as$m AND mo AND onGets terms that match AND version of our wildcard query.But we’d incorrectly enumerate moon as well.Must post-filter these terms against query.Surviving enumerated terms are then looked up in the original term-document inverted index.Fast, space efficient (compared to permuterm).19Processing wild-card queriesAs before, we must execute a Boolean query for each enumerated, filtered term.Wild-cards can result in expensive query execution (very large disjunctions…)Avoid encouraging “laziness” in the UI: SearchType your search terms, use ‘*’ if you need to.E.g., Alex* will match Alexander.20Advanced featuresAvoiding UI clutter is one reason to hide advanced features behind an “Advanced Search” buttonIt also deters most users from unnecessarily hitting the engine with fancy queries21Spelling correction22Spell correctionTwo principal usesCorrecting document(s) being indexedRetrieving matching documents when query contains a spelling errorTwo main flavors:Isolated wordCheck each word on its own for misspellingWill not catch typos resulting in correctly spelled words e.g., from formContext-sensitiveLook at surrounding words, e.g., I flew form Heathrow to Narita.23Document correctionEspecially needed for OCR’ed documentsCorrection algorithms tuned for this: rn vs mCan use domain-specific knowledgeE.g., OCR can
View Full Document