Introduc)on*to*Informa)on*Retrieval*Introduc)on*to*Informa(on)Retrieval)CS276:*Informa)on*Retrieval*and*Web*Search*Pandu*Nayak*and*Prabhakar*Raghavan*Lecture*3:*Dic)onaries*and*tolerant*retrieval*Introduc)on*to*Informa)on*Retrieval*Recap*of*the*previous*lecture* The*type/token*dis)nc)on* Terms*are*normalized*types*put*in*the*dic)onary* Tokeniza)on*problems:* Hyphens,*apostrophes,*compounds,*CJK* Term*equivalence*classing:* Numbers,*case*folding,*stemming,*lemma)za)on* Skip*pointers* Encoding*a*tree‐like*structure*in*a*pos)ngs*list* Biword*indexes*for*phrases* Posi)onal*indexes*for*phrases/proximity*queries*Ch. 2 2*Introduc)on*to*Informa)on*Retrieval*This*lecture* Dic)onary*data*structures* “Tolerant”*retrieval* Wild‐card*queries* Spelling*correc)on* Soundex*Ch. 3 3*Introduc)on*to*Informa)on*Retrieval*Dic)onary*data*structures*for*inverted*indexes* The*dic)onary*data*structure*stores*the*term*vocabulary,*document*frequency,*pointers*to*each*pos)ngs*list*…*in*what*data*structure?*Sec. 3.1 4*Introduc)on*to*Informa)on*Retrieval*A*naïve*dic)onary* An*array*of*struct:**********char[20]***int*******************Pos)ngs************20*bytes***4/8*bytes********4/8*by tes*** How*do*we*store*a*dic)onary*in*memory*efficiently?* How*do*we*quickly*look*up*elements*at*query*)me?*Sec. 3.1 5*Introduc)on*to*Informa)on*Retrieval*Dic)onary*data*structures* Two*main*choices:* Hashtables* Trees* Some*IR*systems*use*hashtables,*some*trees*Sec. 3.1 6*Introduc)on*to*Informa)on*Retrieval*Hashtables* Each*vocabulary*term*is*hashed*to*an*integer* (We*assume*you’ve*seen*hashtables*before)* Pros:* Lookup*is*faster*than*for*a*tree:*O(1)* Cons:* No*easy*way*to*find*minor*variants:* judgment/judgement* No*prefix*search * *[tolerant**retrieval]* If*vocabulary*keeps*growing,*need*to*occasionally*do*the*expensive*opera)on*of*rehashing*everything*Sec. 3.1 7*Introduc)on*to*Informa)on*Retrieval*Root a-m n-z a-hu hy-m n-sh si-z Tree:*binary*tree*Sec. 3.1 8*Introduc)on*to*Informa)on*Retrieval*Tree:*B‐tree* Defini)on:*Every*internal*nodel*has*a*number*of*children*in*the*interval*[a,b]*where*a,*b*are*appropriate*natural*numbers,*e.g.,*[2,4].*a-hu hy-m n-z Sec. 3.1 9*Introduc)on*to*Informa)on*Retrieval*Trees* Simplest:*binary*tree* More*usual:*B‐trees* Trees*require*a*standard*ordering*of*characters*and*hence*strings*…*but*we*typically*have*one* Pros:* Solves*the*prefix*problem*(terms*star)ng*with*hyp)* Cons:* Slower:*O(log*M)**[and*this*requires*balanced*tree]* Rebalancing*binary*trees*is*expensive* But*B‐trees*mi)gate*the*rebalancing*problem*Sec. 3.1 10*Introduc)on*to*Informa)on*Retrieval*WILD‐CARD)QUERIES)11*Introduc)on*to*Informa)on*Retrieval*Wild‐card*queries:*** mon*:*find*all*docs*containing*any*word*beginning*with*“mon”.* Easy*with*binary*tree*(or*B‐tree)*lexicon:*retrieve*all*words*in*range:*mon&≤&w&<&moo& *mon:&find*words*ending*in*“mon”:*harder* Maintain*an*addi)onal*B‐tree*for*terms*backwards.*Can*retrieve*all*words*in*range:*nom&≤&w&<&non.*Exercise: from this, how can we enumerate all terms meeting the wild-card query pro*cent ? Sec. 3.2 12*Introduc)on*to*Informa)on*Retrieval*Query*processing* At*this*point,*we*have*an*enumera)on*of*all*terms*in*the*dic)onary*that*match*the*wild‐card*query.* We*s)ll*have*to*look*up*the*pos)ngs*for*each*enumerated*term.* E.g.,*consider*the*query:**se*ate*AND*fil*er&*This*may*result*in*the*execu)on*of*many*Boolean*AND*queries.*Sec. 3.2 13*Introduc)on*to*Informa)on*Retrieval*B‐trees*handle**’s*at*the*end*of*a*query*term* How*can*we*handle**’s*in*the*middle*of*query*term?* co*2on& We*could*look*up*co**AND**2on*in*a*B‐tree*and*intersect*the*two*term*sets* Expensive* The*solu)on:*transform*wild‐card*queries*so*that*the**’s*occur*at*the*end* This*gives*rise*to*the*Permuterm *Index.*Sec. 3.2 14*Introduc)on*to*Informa)on*Retrieval*Permuterm*index* For*term*hello,*index*under:* hello$,&ello$h,&llo$he,&lo$hel,&o$hell&where*$*is*a*special*symbol.* Queries:* X****lookup*on*X$ )))))X*)))lookup*on***$X*) *X)))lookup*on*X$* ******X***lookup*on***X*) X*Y*lookup*on*Y$X* *****X*Y*Z*****???*Exercise!*Query = hel*o X=hel, Y=o Lookup o$hel* Sec. 3.2.1 15*Introduc)on*to*Informa)on*Retrieval*Permuterm*query*processing* Rotate*quer y*wild‐card*to*the*right* Now*use*B‐tree*lookup*as*before.* Permuterm*problem:*≈*quadruples*lexicon*size*Empirical observation for English. Sec. 3.2.1 16*Introduc)on*to*Informa)on*Retrieval*Bigram*(k‐gram)*indexes* Enumerate*all*k‐grams*(sequence*of*k*chars)*occurring*in*any*term* e.g.,*from*text*“April&is&the&cruelest&month”*we*get*the*2‐grams*(bigrams)* $*is*a*special*word*boundary*symbol* Maintain*a*second*inverted*index*from*bigrams*to*dic)onary*terms*that*match*each*bigram.*$a,ap,pr,ri,il,l$,$i,is,s$,$t,th,he,e$,$c,cr,ru, ue,el,le,es,st,t$, $m,mo,on,nt,h$ Sec. 3.2.2 17*Introduc)on*to*Informa)on*Retrieval*Bigram*index*example* The*k‐gram*index*finds*terms*based*on*a*query*consis)ng*of*k‐grams*(here*k=2).*mo on among $m mace along amortize madden among Sec. 3.2.2 18*Introduc)on*to*Informa)on*Retrieval*Processing*wild‐cards* Query*mon**can*now*be*run*as* $m&AND&mo&AND&on& Gets*terms*that*match*AND*version*of*our*wildcard*query.* But*we’d*enumerate*moon.* Must*post‐filter*these*terms*against*query.* Sur viving*enumerated*terms*are*then*looked*up*in*the*term‐document*inverted*index.* Fast,*space*efficient*(compared*to*permuterm).*Sec. 3.2.2 19*Introduc)on*to*Informa)on*Retrieval*Processing*wild‐card*queries* As*before,*we*must*execute*a*Boolean*query*for*each*enumerated,*filtered*term.* Wild‐cards*can*result*in*expensive*query*execu)on*(very*large*disjunc)ons…)* pyth**AND*prog** If*you*encourage*“laziness”*people*will*respond!* Which*web*search*engines*allow*wildcard*queries?*Search Type your search terms, use ‘*’ if you need to. E.g., Alex* will match Alexander. Sec. 3.2.2
View Full Document