Stanford CS 276 - Lecture 3: Dictionaries and tolerant retrieval

Unformatted text preview:

Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan Lecture 3 Dictionaries and tolerant retrieval Introduction to Information Retrieval Introduction to Information Retrieval Ch 2 Recap of the previous lecture The type token distinction Terms are normalized types put in the dictionary Tokenization problems Hyphens apostrophes compounds CJK Term equivalence classing Numbers case folding stemming lemmatization Skip pointers Encoding a tree like structure in a postings list Biword indexes for phrases Positional indexes for phrases proximity queries 2 Introduction to Information Retrieval Introduction to Information Retrieval Ch 3 This lecture Dictionary data structures Tolerant retrieval Wild card queries Spelling correction Soundex 3 Introduction to Information Retrieval Introduction to Information Retrieval Sec 3 1 Dictionary data structures for inverted indexes The dictionary data structure stores the term vocabulary document frequency pointers to each postings list in what data structure 4 Introduction to Information Retrieval Introduction to Information Retrieval Sec 3 1 A na ve dictionary An array of struct char 20 int Postings 20 bytes 4 8 bytes 4 8 bytes How do we store a dictionary in memory efficiently How do we quickly look up elements at query time 5 Introduction to Information Retrieval Introduction to Information Retrieval Sec 3 1 Dictionary data structures Two main choices Hashtables Trees Some IR systems use hashtables some trees 6 Introduction to Information Retrieval Introduction to Information Retrieval Sec 3 1 Hashtables Each vocabulary term is hashed to an integer We assume you ve seen hashtables before Pros Cons Lookup is faster than for a tree O 1 No easy way to find minor variants judgment judgement No prefix search If vocabulary keeps growing need to occasionally do the tolerant retrieval expensive operation of rehashing everything 7 Introduction to Information Retrieval Introduction to Information Retrieval Sec 3 1 Tree binary tree a m Root n z a hu hy m n sh si z k r a v d r a a s n e g y u h e l k c i s t o g y z 8 Introduction to Information Retrieval Introduction to Information Retrieval Sec 3 1 Tree B tree a hu hy m n z Definition Every internal nodel has a number of children in the interval a b where a b are appropriate natural numbers e g 2 4 9 Introduction to Information Retrieval Introduction to Information Retrieval Sec 3 1 Trees Simplest binary tree More usual B trees Trees require a standard ordering of characters and hence strings but we typically have one Pros Cons Solves the prefix problem terms starting with hyp Slower O log M and this requires balanced tree Rebalancing binary trees is expensive But B trees mitigate the rebalancing problem 10 Introduction to Information Retrieval Introduction to Information Retrieval WILD CARD QUERIES 11 Introduction to Information Retrieval Introduction to Information Retrieval Sec 3 2 Wild card queries mon find all docs containing any word beginning with mon Easy with binary tree or B tree lexicon retrieve all words in range mon w moo mon find words ending in mon harder Maintain an additional B tree for terms backwards Can retrieve all words in range nom w non Exercise from this how can we enumerate all terms meeting the wild card query pro cent 12 Introduction to Information Retrieval Introduction to Information Retrieval Sec 3 2 Query processing At this point we have an enumeration of all terms in the dictionary that match the wild card query We still have to look up the postings for each enumerated term E g consider the query se ate AND fil er This may result in the execution of many Boolean AND queries 13 Introduction to Information Retrieval Introduction to Information Retrieval B trees handle s at the end of a query term How can we handle s in the middle of query term Sec 3 2 co tion We could look up co AND tion in a B tree and intersect the two term sets Expensive The solution transform wild card queries so that the s occur at the end This gives rise to the Permuterm Index 14 Introduction to Information Retrieval Introduction to Information Retrieval Sec 3 2 1 Permuterm index For term hello index under hello ello h llo he lo hel o hell hello where is a special symbol Queries X lookup on X X lookup on X X lookup on X X Y lookup on Y X X Y Z Exercise X lookup on X Query hel o X hel Y o Lookup o hel 15 Introduction to Information Retrieval Introduction to Information Retrieval Sec 3 2 1 Permuterm query processing Rotate query wild card to the right Now use B tree lookup as before Permuterm problem quadruples lexicon size Empirical observation for English 16 Introduction to Information Retrieval Introduction to Information Retrieval Sec 3 2 2 Bigram k gram indexes Enumerate all k grams sequence of k chars occurring in any term e g from text April is the cruelest month we get the 2 grams bigrams a ap pr ri il l i is s t th he e c cr ru ue el le es st t m mo on nt h is a special word boundary symbol Maintain a second inverted index from bigrams to dictionary terms that match each bigram 17 Introduction to Information Retrieval Introduction to Information Retrieval Sec 3 2 2 Bigram index example The k gram index finds terms based on a query consisting of k grams here k 2 m mo on mace among along madden amortize among 18 Introduction to Information Retrieval Introduction to Information Retrieval Sec 3 2 2 Processing wild cards Query mon can now be run as m AND mo AND on Gets terms that match AND version of our wildcard query But we d enumerate moon Must post filter these terms against query Surviving enumerated terms are then looked up in the term document inverted index Fast space efficient compared to permuterm 19 Introduction to Information Retrieval Introduction to Information Retrieval Sec 3 2 2 Processing wild card queries As before we must execute a Boolean query for each enumerated filtered term Wild cards can result in expensive query execution very large disjunctions pyth AND prog If you encourage laziness people will respond Type your search terms use if you need to E g Alex will match Alexander Which web search engines allow wildcard queries 20 Search Introduction to Information Retrieval Introduction to Information Retrieval SPELLING CORRECTION 21 Introduction to Information Retrieval Introduction to Information Retrieval Sec 3 3 Spell correction Two principal uses Two main flavors Isolated


View Full Document

Stanford CS 276 - Lecture 3: Dictionaries and tolerant retrieval

Documents in this Course
Load more
Download Lecture 3: Dictionaries and tolerant retrieval
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture 3: Dictionaries and tolerant retrieval and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture 3: Dictionaries and tolerant retrieval 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?