Stanford CS 276 - Dictionaries and tolerant retrieval - D2388102

Home> Schools> Stanford University> Computer Science (CS) > CS 276> Dictionaries and tolerant retrieval

DOC PREVIEW

Stanford CS 276 - Dictionaries and tolerant retrieval

School name Stanford University

Course Cs 276- Information Retrieval and Web Search

Pages 48

This preview shows page 1-2-3-23-24-25-26-46-47-48 out of 48 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 48 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 48 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 48 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 48 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 48 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 48 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 48 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 48 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 48 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 48 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 48 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Slide 1Recap of the previous lectureThis lectureDictionary data structures for inverted indexesA naïve dictionaryDictionary data structuresHashtablesTree: binary treeTree: B-treeTreesWild-card queriesWild-card queries: *Query processingB-trees handle *’s at the end of a query termPermuterm indexPermuterm query processingBigram (k-gram) indexesBigram index exampleProcessing wild-cardsProcessing wild-card queriesSpelling correctionSpell correctionDocument correctionQuery mis-spellingsIsolated word correctionSlide 26Edit distanceWeighted edit distanceUsing edit distancesEdit distance to all dictionary terms?n-gram overlapExample with trigramsOne option – Jaccard coefficientMatching trigramsContext-sensitive spell correctionContext-sensitive correctionExerciseAnother approachGeneral issues in spell correctionSoundexSlide 41Soundex – typical algorithmSlide 43Soundex continuedSlide 45What queries can we process?Slide 47ResourcesIntroduction to Information RetrievalIntroduction to Information Retrieval Introduction toInformation RetrievalCS276: Information Retrieval and Web SearchPandu Nayak and Prabhakar RaghavanLecture 3: Dictionaries and tolerant retrievalIntroduction to Information RetrievalIntroduction to Information Retrieval Recap of the previous lectureThe type/token distinctionTerms are normalized types put in the dictionaryTokenization problems:Hyphens, apostrophes, compounds, CJKTerm equivalence classing:Numbers, case folding, stemming, lemmatizationSkip pointersEncoding a tree-like structure in a postings listBiword indexes for phrasesPositional indexes for phrases/proximity queriesCh. 22Introduction to Information RetrievalIntroduction to Information Retrieval This lectureDictionary data structures“Tolerant” retrievalWild-card queriesSpelling correctionSoundexCh. 33Introduction to Information RetrievalIntroduction to Information Retrieval Dictionary data structures for inverted indexesThe dictionary data structure stores the term vocabulary, document frequency, pointers to each postings list … in what data structure?Sec. 3.14Introduction to Information RetrievalIntroduction to Information Retrieval A naïve dictionaryAn array of struct: char[20] int Postings * 20 bytes 4/8 bytes 4/8 bytes How do we store a dictionary in memory efficiently?How do we quickly look up elements at query time?Sec. 3.15Introduction to Information RetrievalIntroduction to Information Retrieval Dictionary data structuresTwo main choices:HashtablesTreesSome IR systems use hashtables, some treesSec. 3.16Introduction to Information RetrievalIntroduction to Information Retrieval HashtablesEach vocabulary term is hashed to an integer(We assume you’ve seen hashtables before)Pros:Lookup is faster than for a tree: O(1)Cons:No easy way to find minor variants:judgment/judgementNo prefix search [tolerant retrieval]If vocabulary keeps growing, need to occasionally do the expensive operation of rehashing everythingSec. 3.17Introduction to Information RetrievalIntroduction to Information Retrieval Roota-mn-za-hu hy-m n-sh si-zaardvarkhuygenssicklezygotTree: binary treeSec. 3.18Introduction to Information RetrievalIntroduction to Information Retrieval Tree: B-treeDefinition: Every internal nodel has a number of children in the interval [a,b] where a, b are appropriate natural numbers, e.g., [2,4].a-huhy-mn-zSec. 3.19Introduction to Information RetrievalIntroduction to Information Retrieval TreesSimplest: binary treeMore usual: B-treesTrees require a standard ordering of characters and hence strings … but we typically have onePros:Solves the prefix problem (terms starting with hyp)Cons:Slower: O(log M) [and this requires balanced tree]Rebalancing binary trees is expensiveBut B-trees mitigate the rebalancing problemSec. 3.110Introduction to Information RetrievalIntroduction to Information Retrieval WILD-CARD QUERIES11Introduction to Information RetrievalIntroduction to Information Retrieval Wild-card queries: *mon*: find all docs containing any word beginning with “mon”.Easy with binary tree (or B-tree) lexicon: retrieve all words in range: mon ≤ w < moo*mon: find words ending in “mon”: harderMaintain an additional B-tree for terms backwards.Can retrieve all words in range: nom ≤ w < non.Exercise: from this, how can we enumerate all termsmeeting the wild-card query pro*cent ?Sec. 3.212Introduction to Information RetrievalIntroduction to Information Retrieval Query processingAt this point, we have an enumeration of all terms in the dictionary that match the wild-card query.We still have to look up the postings for each enumerated term.E.g., consider the query:se*ate AND fil*erThis may result in the execution of many Boolean AND queries.Sec. 3.213Introduction to Information RetrievalIntroduction to Information Retrieval B-trees handle *’s at the end of a query termHow can we handle *’s in the middle of query term?co*tionWe could look up co* AND *tion in a B-tree and intersect the two term setsExpensiveThe solution: transform wild-card queries so that the *’s occur at the endThis gives rise to the Permuterm Index.Sec. 3.214Introduction to Information RetrievalIntroduction to Information Retrieval Permuterm indexFor term hello, index under:hello$, ello$h, llo$he, lo$hel, o$hellwhere $ is a special symbol.Queries:X lookup on X$ X* lookup on $X**X lookup on X$* *X* lookup on X*X*Y lookup on Y$X* X*Y*Z ??? Exercise!Query = hel*oX=hel, Y=oLookup o$hel*Sec. 3.2.115Introduction to Information RetrievalIntroduction to Information Retrieval Permuterm query processingRotate query wild-card to the rightNow use B-tree lookup as before.Permuterm problem: ≈ quadruples lexicon sizeEmpirical observation for English.Sec. 3.2.116Introduction to Information RetrievalIntroduction to Information Retrieval Bigram (k-gram) indexesEnumerate all k-grams (sequence of k chars) occurring in any terme.g., from text “April is the cruelest month” we get the 2-grams (bigrams)$ is a special word boundary symbolMaintain a second inverted index from bigrams to dictionary terms that match each bigram.$a,ap,pr,ri,il,l$,$i,is,s$,$t,th,he,e$,$c,cr,ru,ue,el,le,es,st,t$, $m,mo,on,nt,h$Sec. 3.2.217Introduction to

View Full Document