Slide 1Recap of the previous lectureThis lectureDictionary data structures for inverted indexesA naïve dictionaryDictionary data structuresHashtablesTree: binary treeTree: B-treeTreesWild-card queriesWild-card queries: *Query processingB-trees handle *’s at the end of a query termPermuterm indexPermuterm query processingBigram (k-gram) indexesBigram index exampleProcessing wild-cardsProcessing wild-card queriesSpelling correctionSpell correctionDocument correctionQuery mis-spellingsIsolated word correctionSlide 26Edit distanceWeighted edit distanceUsing edit distancesEdit distance to all dictionary terms?n-gram overlapExample with trigramsOne option – Jaccard coefficientMatching trigramsContext-sensitive spell correctionContext-sensitive correctionExerciseAnother approachGeneral issues in spell correctionSoundexSlide 41Soundex – typical algorithmSlide 43Soundex continuedSlide 45What queries can we process?Slide 47ResourcesIntroduction to Information RetrievalIntroduction to Information Retrieval Introduction toInformation RetrievalCS276: Information Retrieval and Web SearchPandu Nayak and Prabhakar RaghavanLecture 3: Dictionaries and tolerant retrievalIntroduction to Information RetrievalIntroduction to Information Retrieval Recap of the previous lectureThe type/token distinctionTerms are normalized types put in the dictionaryTokenization problems:Hyphens, apostrophes, compounds, CJKTerm equivalence classing:Numbers, case folding, stemming, lemmatizationSkip pointersEncoding a tree-like structure in a postings listBiword indexes for phrasesPositional indexes for phrases/proximity queriesCh. 22Introduction to Information RetrievalIntroduction to Information Retrieval This lectureDictionary data structures“Tolerant” retrievalWild-card queriesSpelling correctionSoundexCh. 33Introduction to Information RetrievalIntroduction to Information Retrieval Dictionary data structures for inverted indexesThe dictionary data structure stores the term vocabulary, document frequency, pointers to each postings list … in what data structure?Sec. 3.14Introduction to Information RetrievalIntroduction to Information Retrieval A naïve dictionaryAn array of struct: char[20] int Postings * 20 bytes 4/8 bytes 4/8 bytes How do we store a dictionary in memory efficiently?How do we quickly look up elements at query time?Sec. 3.15Introduction to Information RetrievalIntroduction to Information Retrieval Dictionary data structuresTwo main choices:HashtablesTreesSome IR systems use hashtables, some treesSec. 3.16Introduction to Information RetrievalIntroduction to Information Retrieval HashtablesEach vocabulary term is hashed to an integer(We assume you’ve seen hashtables before)Pros:Lookup is faster than for a tree: O(1)Cons:No easy way to find minor variants:judgment/judgementNo prefix search [tolerant retrieval]If vocabulary keeps growing, need to occasionally do the expensive operation of rehashing everythingSec. 3.17Introduction to Information RetrievalIntroduction to Information Retrieval Roota-mn-za-hu hy-m n-sh si-zaardvarkhuygenssicklezygotTree: binary treeSec. 3.18Introduction to Information RetrievalIntroduction to Information Retrieval Tree: B-treeDefinition: Every internal nodel has a number of children in the interval [a,b] where a, b are appropriate natural numbers, e.g., [2,4].a-huhy-mn-zSec. 3.19Introduction to Information RetrievalIntroduction to Information Retrieval TreesSimplest: binary treeMore usual: B-treesTrees require a standard ordering of characters and hence strings … but we typically have onePros:Solves the prefix problem (terms starting with hyp)Cons:Slower: O(log M) [and this requires balanced tree]Rebalancing binary trees is expensiveBut B-trees mitigate the rebalancing problemSec. 3.110Introduction to Information RetrievalIntroduction to Information Retrieval WILD-CARD QUERIES11Introduction to Information RetrievalIntroduction to Information Retrieval Wild-card queries: *mon*: find all docs containing any word beginning with “mon”.Easy with binary tree (or B-tree) lexicon: retrieve all words in range: mon ≤ w < moo*mon: find words ending in “mon”: harderMaintain an additional B-tree for terms backwards.Can retrieve all words in range: nom ≤ w < non.Exercise: from this, how can we enumerate all termsmeeting the wild-card query pro*cent ?Sec. 3.212Introduction to Information RetrievalIntroduction to Information Retrieval Query processingAt this point, we have an enumeration of all terms in the dictionary that match the wild-card query.We still have to look up the postings for each enumerated term.E.g., consider the query:se*ate AND fil*erThis may result in the execution of many Boolean AND queries.Sec. 3.213Introduction to Information RetrievalIntroduction to Information Retrieval B-trees handle *’s at the end of a query termHow can we handle *’s in the middle of query term?co*tionWe could look up co* AND *tion in a B-tree and intersect the two term setsExpensiveThe solution: transform wild-card queries so that the *’s occur at the endThis gives rise to the Permuterm Index.Sec. 3.214Introduction to Information RetrievalIntroduction to Information Retrieval Permuterm indexFor term hello, index under:hello$, ello$h, llo$he, lo$hel, o$hellwhere $ is a special symbol.Queries:X lookup on X$ X* lookup on $X**X lookup on X$* *X* lookup on X*X*Y lookup on Y$X* X*Y*Z ??? Exercise!Query = hel*oX=hel, Y=oLookup o$hel*Sec. 3.2.115Introduction to Information RetrievalIntroduction to Information Retrieval Permuterm query processingRotate query wild-card to the rightNow use B-tree lookup as before.Permuterm problem: ≈ quadruples lexicon sizeEmpirical observation for English.Sec. 3.2.116Introduction to Information RetrievalIntroduction to Information Retrieval Bigram (k-gram) indexesEnumerate all k-grams (sequence of k chars) occurring in any terme.g., from text “April is the cruelest month” we get the 2-grams (bigrams)$ is a special word boundary symbolMaintain a second inverted index from bigrams to dictionary terms that match each bigram.$a,ap,pr,ri,il,l$,$i,is,s$,$t,th,he,e$,$c,cr,ru,ue,el,le,es,st,t$, $m,mo,on,nt,h$Sec. 3.2.217Introduction to
View Full Document