Wright CS 707 - Tolerant IR - D2172432

Home> Schools> Wright State University> Counseling (CS) > CS 707> Tolerant IR

Wright CS 707 - Tolerant IR

Pages 47

Download Save

Unformatted text preview:

Tolerant IRThis lectureDictionary data structures for inverted indexesA naïve dictionaryDictionary data structuresHashtablesTree: binary treeTree: B-treeTreesWild-card queriesWild-card queries: *Query processingB-trees handle *’s at the end of a query termPermuterm indexPermuterm query processingBigram indexesBigram index exampleProcessing n-gram wild-cardsProcessing wild-card queriesAdvanced featuresSpelling correctionSpell correctionDocument correctionQuery mis-spellingsIsolated word correctionSlide 26Edit distanceWeighted edit distanceUsing edit distancesEdit distance to all dictionary terms?n-gram overlapExample with trigramsOne option – Jaccard coefficientMatching trigramsContext-sensitive spell correctionContext-sensitive correctionGeneral issues in spell correctionComputational costPowerPoint PresentationSoundexSoundex – typical algorithmSlide 45Soundex continuedExerciseSlide 48What queries can we process?Aside – results cachingPrasad L05TolerantIR 1Tolerant IRAdapted from Lectures byPrabhakar Raghavan (Yahoo and Stanford) and Christopher Manning (Stanford)2This lecture“Tolerant” retrievalWild-card queriesSpelling correctionSoundexMotivationFor expressiveness to accommodate variants (e.g., automat*)To deal with incomplete information about spelling or multiple spellings (E.g., S*dney)Dictionary data structures for inverted indexesThe dictionary data structure stores the term vocabulary, document frequency, pointers to each postings list … in what data structure?Sec. 3.13A naïve dictionaryAn array of struct: char[20] int Postings * 20 bytes 4/8 bytes 4/8 bytes How do we store a dictionary in memory efficiently?How do we quickly look up elements at query time?Sec. 3.14Dictionary data structuresTwo main choices:HashtablesTreesSome IR systems use hashtables, some treesSec. 3.15HashtablesEach vocabulary term is hashed to an integer(We assume you’ve seen hashtables before)Pros:Lookup is faster than for a tree: O(1)Cons:No easy way to find minor variants:judgment/judgementNo prefix search [tolerant retrieval]If vocabulary keeps growing, need to occasionally do the expensive operation of rehashing everythingSec. 3.16Roota-mn-za-hu hy-m n-sh si-zaardvarkhuygenssicklezygotTree: binary treeSec. 3.17Tree: B-treeDefinition: Every internal nodel has a number of children in the interval [a,b] where a, b are appropriate natural numbers, e.g., [2,4].a-huhy-mn-zSec. 3.18TreesSimplest: binary treeMore usual: B-treesTrees require a standard ordering of characters and hence strings … but we typically have onePros:Solves the prefix problem (terms starting with hyp)Cons:Slower: O(log M) [and this requires balanced tree]Rebalancing binary trees is expensiveBut B-trees mitigate the rebalancing problemSec. 3.1910Wild-card queries11Wild-card queries: *mon*: find all docs containing any word beginning with “mon”.Hashing unsuitable because order not preservedEasy with binary tree (or B-tree) lexicon: retrieve all words in range: mon ≤ w < moo*mon: find words ending in “mon”: harderMaintain an additional B-tree for terms backwards.Can retrieve all words in range: nom ≤ w < non.Exercise: from this, how can we enumerate all termsmeeting the wild-card query pro*cent ?12Query processingAt this point, we have an enumeration of all terms in the dictionary that match the wild-card query.We still have to look up the postings for each enumerated term.E.g., consider the query:se*ate AND fil*erThis may result in the execution of many Boolean AND queries.13B-trees handle *’s at the end of a query termHow can we handle *’s in the middle of query term? (Especially multiple *’s)Consider co*tionWe could look up co* AND *tion in a B-tree and intersect the two term setsExpensiveThe solution: transform every wild-card query so that the *’s occur at the endThis gives rise to the Permuterm Index.14Permuterm indexFor term hello index under:hello$, ello$h, llo$he, lo$hel, o$hellwhere $ is a special symbol.Queries: X lookup on X$ *X lookup on X$**X* lookup on X*X*Y lookup on Y$X* X*Y*Z ??? Query = hel*oX=hel, Y=oLookup o$hel*15Permuterm query processingRotate query wild-card to the rightNow use B-tree lookup as before.Permuterm problem: ≈ quadruples lexicon sizeEmpirical observation for English.16Bigram indexesEnumerate all k-grams (sequence of k chars) occurring in any terme.g., from text “April is the cruelest month” we get the 2-grams (bigrams)$ is a special word boundary symbolMaintain a second “inverted” index from bigrams to dictionary terms that match each bigram.$a,ap,pr,ri,il,l$,$i,is,s$,$t,th,he,e$,$c,cr,ru,ue,el,le,es,st,t$, $m,mo,on,nt,h$17Bigram index examplemoonamong$m maceamongamortizemaddenaroundThe k-gram index finds terms based on a query consisting of k-grams (here k=2).18Processing n-gram wild-cardsQuery mon* can now be run as$m AND mo AND onGets terms that match AND version of our wildcard query.But we’d incorrectly enumerate moon as well.Must post-filter these terms against query.Surviving enumerated terms are then looked up in the original term-document inverted index.Fast, space efficient (compared to permuterm).19Processing wild-card queriesAs before, we must execute a Boolean query for each enumerated, filtered term.Wild-cards can result in expensive query execution (very large disjunctions…)Avoid encouraging “laziness” in the UI: SearchType your search terms, use ‘*’ if you need to.E.g., Alex* will match Alexander.20Advanced featuresAvoiding UI clutter is one reason to hide advanced features behind an “Advanced Search” buttonIt also deters most users from unnecessarily hitting the engine with fancy queries21Spelling correction22Spell correctionTwo principal usesCorrecting document(s) being indexedRetrieving matching documents when query contains a spelling errorTwo main flavors:Isolated wordCheck each word on its own for misspellingWill not catch typos resulting in correctly spelled words e.g., from  formContext-sensitiveLook at surrounding words, e.g., I flew form Heathrow to Narita.23Document correctionEspecially needed for OCR’ed documentsCorrection algorithms tuned for this: rn vs mCan use domain-specific knowledgeE.g., OCR can

View Full Document


School:
Email:
New Password:
Confirm Password:

Wright CS 707 - Tolerant IR

Sign up for free to view:

Please select your school