CS6322: CS6322: Information Retrieval Information Retrieval Sanda Sanda HarabagiuHarabagiuLecture 3: Dictionaries and Lecture 3: Dictionaries and tolerant retrievaltolerant retrievalCS 6322 Information RetrievalCS 6322 Information RetrievalRecap of the previous lecture The type/token distinction Terms are normalized types put in the dictionary Tokenization problems: Hyphens, apostrophes, compounds, Chinese Term equivalence classing: Numbers, case folding, stemming, lemmatization Skip pointers Encoding a tree-like structure in a postings list Biword indexes for phrases Positional indexes for phrases/proximity queriesCh. 2CS 6322 Information RetrievalCS 6322 Information RetrievalThis lecture Dictionary data structures “Tolerant” retrieval Wild-card queries Spelling correction SoundexCh. 3CS 6322 Information RetrievalCS 6322 Information RetrievalDictionary data structures for inverted indexes The dictionary data structure stores the term vocabulary, document frequency, pointers to each postings list … in what data structure?Sec. 3.1CS 6322 Information RetrievalCS 6322 Information RetrievalA naïve dictionary An array of struct:char[20] int Postings *20 bytes 4/8 bytes 4/8 bytes How do we store a dictionary in memory efficiently? How do we quickly look up elements at query time?Sec. 3.1CS 6322 Information RetrievalCS 6322 Information RetrievalDictionary data structures Two main choices: Hash table Tree Some IR systems use hashes, some treesSec. 3.1CS 6322 Information RetrievalCS 6322 Information RetrievalHashes Each vocabulary term is hashed to an integer (We assume you’ve seen hashtables before) Pros: Lookup is faster than for a tree: O(1) Cons: No easy way to find minor variants: judgment/judgement No prefix search [tolerant retrieval] If vocabulary keeps growing, need to occasionally do the expensive operation of rehashing everythingSec. 3.1CS 6322 Information RetrievalCS 6322 Information RetrievalRoota-mn-za-hu hy-m n-sh si-zaardvarkhuygenssicklezygotTree: binary treeSec. 3.1CS 6322 Information RetrievalCS 6322 Information RetrievalTree: B-tree Definition: Every internal nodel has a number of children in the interval [a,b] where a, b are appropriate natural numbers, e.g., [2,4].a-huhy-mn-zSec. 3.1CS 6322 Information RetrievalCS 6322 Information RetrievalTrees Simplest: binary tree More usual: B-trees Trees require a standard ordering of characters and hence strings … but we standardly have one Pros: Solves the prefix problem (terms starting with hyp) Cons: Slower: O(log M) [and this requires balanced tree] Rebalancing binary trees is expensive But B-trees mitigate the rebalancing problemSec. 3.1CS 6322 Information RetrievalCS 6322 Information RetrievalWILD-CARD QUERIESCS 6322 Information RetrievalCS 6322 Information RetrievalWild-card queries: * mon*: find all docs containing any word beginning “mon”. Easy with binary tree (or B-tree) lexicon: retrieve all words in range: mon≤≤≤≤w < moo *mon: find words ending in “mon”: harder Maintain an additional B-tree for terms backwards.Can retrieve all words in range: nom ≤≤≤≤w < non.Exercise: from this, how can we enumerate all termsmeeting the wild-card query pro*cent ?Sec. 3.2CS 6322 Information RetrievalCS 6322 Information RetrievalQuery processing At this point, we have an enumeration of all terms in the dictionary that match the wild-card query. We still have to look up the postings for each enumerated term. E.g., consider the query:se*ate AND fil*erThis may result in the execution of many Boolean AND queries.Sec. 3.2CS 6322 Information RetrievalCS 6322 Information RetrievalB-trees handle *’s at the end of a query term How can we handle *’s in the middle of query term? co*tion We could look up co* AND *tion in a B-tree and intersect the two term sets Expensive The solution: transform wild-card queries so that the *’s occur at the end This gives rise to the Permuterm Index.Sec. 3.2CS 6322 Information RetrievalCS 6322 Information RetrievalPermuterm index For term hello, index under: hello$, ello$h, llo$he, lo$hel, o$hellwhere $ is a special symbol. Queries: X lookup on X$ X* lookup on $X* *X lookup on X$* *X* lookup on X* X*Y lookup on Y$X* X*Y*Z ??? Exercise!Query = hel*oX=hel, Y=oLookup o$hel*Sec. 3.2.1CS 6322 Information RetrievalCS 6322 Information RetrievalPermuterm query processing Rotate query wild-card to the right Now use B-tree lookup as before. Permuterm problem: ≈ quadruples lexicon sizeEmpirical observation for English.Sec. 3.2.1CS 6322 Information RetrievalCS 6322 Information RetrievalBigram (k-gram) indexes Enumerate all k-grams (sequence of k chars) occurring in any term e.g., from text “April is the cruelest month” we get the 2-grams (bigrams) $ is a special word boundary symbol Maintain a second inverted index from bigrams todictionary terms that match each bigram.$a,ap,pr,ri,il,l$,$i,is,s$,$t,th,he,e$,$c,cr,ru,ue,el,le,es,st,t$, $m,mo,on,nt,h$Sec. 3.2.2CS 6322 Information RetrievalCS 6322 Information RetrievalBigram index example The k-gram index finds terms based on a query consisting of k-grams (here k=2).moonamong$m maceamongamortizemaddenaroundSec. 3.2.2CS 6322 Information RetrievalCS 6322 Information RetrievalProcessing wild-cards Query mon* can now be run as $m AND mo AND on Gets terms that match AND version of our wildcard query. But we’d enumerate moon. Must post-filter these terms against query. Surviving enumerated terms are then looked up in the term-document inverted index. Fast, space efficient (compared to permuterm).Sec. 3.2.2CS 6322 Information RetrievalCS 6322 Information RetrievalProcessing wild-card queries As before, we must execute a Boolean query for each enumerated, filtered term. Wild-cards can result in expensive query execution (very large disjunctions…) pyth* AND prog* If you encourage “laziness” people will respond! Which web search engines allow wildcard queries?SearchType your search terms, use ‘*’ if you need to.E.g., Alex* will match Alexander.Sec. 3.2.2CS 6322 Information RetrievalCS 6322 Information RetrievalSPELLING CORRECTIONCS 6322 Information RetrievalCS 6322 Information RetrievalSpell correction Two principal uses Correcting document(s) being indexed
View Full Document