DOC PREVIEW
UT Dallas CS 6359 - Lecture3

This preview shows page 1-2-3-23-24-25-26-47-48-49 out of 49 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 49 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 49 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 49 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 49 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 49 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 49 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 49 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 49 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 49 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 49 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 49 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

CS6322: CS6322: Information Retrieval Information Retrieval Sanda Sanda HarabagiuHarabagiuLecture 3: Dictionaries and Lecture 3: Dictionaries and tolerant retrievaltolerant retrievalCS 6322 Information RetrievalCS 6322 Information RetrievalRecap of the previous lecture The type/token distinction Terms are normalized types put in the dictionary Tokenization problems: Hyphens, apostrophes, compounds, Chinese Term equivalence classing: Numbers, case folding, stemming, lemmatization Skip pointers Encoding a tree-like structure in a postings list Biword indexes for phrases Positional indexes for phrases/proximity queriesCh. 2CS 6322 Information RetrievalCS 6322 Information RetrievalThis lecture Dictionary data structures “Tolerant” retrieval Wild-card queries Spelling correction SoundexCh. 3CS 6322 Information RetrievalCS 6322 Information RetrievalDictionary data structures for inverted indexes The dictionary data structure stores the term vocabulary, document frequency, pointers to each postings list … in what data structure?Sec. 3.1CS 6322 Information RetrievalCS 6322 Information RetrievalA naïve dictionary An array of struct:char[20] int Postings *20 bytes 4/8 bytes 4/8 bytes  How do we store a dictionary in memory efficiently? How do we quickly look up elements at query time?Sec. 3.1CS 6322 Information RetrievalCS 6322 Information RetrievalDictionary data structures Two main choices: Hash table Tree Some IR systems use hashes, some treesSec. 3.1CS 6322 Information RetrievalCS 6322 Information RetrievalHashes Each vocabulary term is hashed to an integer (We assume you’ve seen hashtables before) Pros: Lookup is faster than for a tree: O(1) Cons: No easy way to find minor variants: judgment/judgement No prefix search [tolerant retrieval] If vocabulary keeps growing, need to occasionally do the expensive operation of rehashing everythingSec. 3.1CS 6322 Information RetrievalCS 6322 Information RetrievalRoota-mn-za-hu hy-m n-sh si-zaardvarkhuygenssicklezygotTree: binary treeSec. 3.1CS 6322 Information RetrievalCS 6322 Information RetrievalTree: B-tree Definition: Every internal nodel has a number of children in the interval [a,b] where a, b are appropriate natural numbers, e.g., [2,4].a-huhy-mn-zSec. 3.1CS 6322 Information RetrievalCS 6322 Information RetrievalTrees Simplest: binary tree More usual: B-trees Trees require a standard ordering of characters and hence strings … but we standardly have one Pros: Solves the prefix problem (terms starting with hyp) Cons: Slower: O(log M) [and this requires balanced tree] Rebalancing binary trees is expensive But B-trees mitigate the rebalancing problemSec. 3.1CS 6322 Information RetrievalCS 6322 Information RetrievalWILD-CARD QUERIESCS 6322 Information RetrievalCS 6322 Information RetrievalWild-card queries: * mon*: find all docs containing any word beginning “mon”. Easy with binary tree (or B-tree) lexicon: retrieve all words in range: mon≤≤≤≤w < moo *mon: find words ending in “mon”: harder Maintain an additional B-tree for terms backwards.Can retrieve all words in range: nom ≤≤≤≤w < non.Exercise: from this, how can we enumerate all termsmeeting the wild-card query pro*cent ?Sec. 3.2CS 6322 Information RetrievalCS 6322 Information RetrievalQuery processing At this point, we have an enumeration of all terms in the dictionary that match the wild-card query. We still have to look up the postings for each enumerated term. E.g., consider the query:se*ate AND fil*erThis may result in the execution of many Boolean AND queries.Sec. 3.2CS 6322 Information RetrievalCS 6322 Information RetrievalB-trees handle *’s at the end of a query term How can we handle *’s in the middle of query term? co*tion We could look up co* AND *tion in a B-tree and intersect the two term sets Expensive The solution: transform wild-card queries so that the *’s occur at the end This gives rise to the Permuterm Index.Sec. 3.2CS 6322 Information RetrievalCS 6322 Information RetrievalPermuterm index For term hello, index under: hello$, ello$h, llo$he, lo$hel, o$hellwhere $ is a special symbol. Queries: X lookup on X$ X* lookup on $X* *X lookup on X$* *X* lookup on X* X*Y lookup on Y$X* X*Y*Z ??? Exercise!Query = hel*oX=hel, Y=oLookup o$hel*Sec. 3.2.1CS 6322 Information RetrievalCS 6322 Information RetrievalPermuterm query processing Rotate query wild-card to the right Now use B-tree lookup as before. Permuterm problem: ≈ quadruples lexicon sizeEmpirical observation for English.Sec. 3.2.1CS 6322 Information RetrievalCS 6322 Information RetrievalBigram (k-gram) indexes Enumerate all k-grams (sequence of k chars) occurring in any term e.g., from text “April is the cruelest month” we get the 2-grams (bigrams) $ is a special word boundary symbol Maintain a second inverted index from bigrams todictionary terms that match each bigram.$a,ap,pr,ri,il,l$,$i,is,s$,$t,th,he,e$,$c,cr,ru,ue,el,le,es,st,t$, $m,mo,on,nt,h$Sec. 3.2.2CS 6322 Information RetrievalCS 6322 Information RetrievalBigram index example The k-gram index finds terms based on a query consisting of k-grams (here k=2).moonamong$m maceamongamortizemaddenaroundSec. 3.2.2CS 6322 Information RetrievalCS 6322 Information RetrievalProcessing wild-cards Query mon* can now be run as $m AND mo AND on Gets terms that match AND version of our wildcard query. But we’d enumerate moon. Must post-filter these terms against query. Surviving enumerated terms are then looked up in the term-document inverted index. Fast, space efficient (compared to permuterm).Sec. 3.2.2CS 6322 Information RetrievalCS 6322 Information RetrievalProcessing wild-card queries As before, we must execute a Boolean query for each enumerated, filtered term. Wild-cards can result in expensive query execution (very large disjunctions…) pyth* AND prog* If you encourage “laziness” people will respond! Which web search engines allow wildcard queries?SearchType your search terms, use ‘*’ if you need to.E.g., Alex* will match Alexander.Sec. 3.2.2CS 6322 Information RetrievalCS 6322 Information RetrievalSPELLING CORRECTIONCS 6322 Information RetrievalCS 6322 Information RetrievalSpell correction Two principal uses Correcting document(s) being indexed


View Full Document

UT Dallas CS 6359 - Lecture3

Documents in this Course
Lecture2

Lecture2

63 pages

Lecture4

Lecture4

48 pages

Lecture5

Lecture5

47 pages

Lecture6

Lecture6

45 pages

Lecture7

Lecture7

63 pages

Lecture8

Lecture8

77 pages

Lecture9

Lecture9

48 pages

Lecture10

Lecture10

84 pages

Lecture11

Lecture11

45 pages

Lecture12

Lecture12

134 pages

Lecture13

Lecture13

62 pages

Lecture14

Lecture14

76 pages

Project

Project

2 pages

Chapter_1

Chapter_1

25 pages

Load more
Download Lecture3
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture3 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture3 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?