Unformatted text preview:

1 CSCI 5417 Information Retrieval Systems Jim Martin!Lecture 3 8/30/2010 CSCI5417-IRToday 8/30  Review  Conjunctive queries (intersect)  Dictionary contents  Phrasal queries  Tolerant query handling  Wildcards  Spelling correction2 CSCI5417-IRDoc # Freq2 12 11 12 11 11 12 21 11 12 11 21 12 11 11 22 11 12 12 11 12 12 12 11 12 12 1 Term N docs Coll freqambitious 1 1be 1 1brutus 2 2capitol 1 1caesar 2 3did 1 1enact 1 1hath 1 1I 1 2i' 1 1it 1 1julius 1 1killed 1 2let 1 1me 1 1noble 1 1so 1 1the 2 2told 1 1you 1 1was 2 2with 1 1Term Doc # Freqambitious 2 1be 2 1brutus 1 1brutus 2 1capitol 1 1caesar 1 1caesar 2 2did 1 1enact 1 1hath 2 1I 1 2i' 1 1it 2 1julius 1 1killed 1 2let 2 1me 1 1noble 2 1so 2 1the 1 1the 2 1told 2 1you 2 1was 1 1was 2 1with 2 1Index: Dictionary and Postings CSCI5417-IRBoolean AND: Intersection (1)3 CSCI5417-IRBoolean AND: Intersection (2) CSCI5417-IRReview: Dictionary  What goes into creating the terms that make it into the dictionary?  Tokenization  Case folding  Stemming  Stop-listing  Normalization  Dealing with numbers (and number-like entities)  Complex morphology4 Dictionary  The dictionary data structure stores the term vocabulary, document frequency, and pointers to each postings list. In what kind of data structure? CSCI5417-IRA Naïve Dictionary  An array of structs? char[20] int postings * 20 bytes 4/8 bytes 4/8 bytes  How do we quickly look up elements at query time? CSCI5417-IR5 Dictionary Data Structures  Two main choices:  Hash tables  Trees  Some IR systems use hashes, some trees. Choice depends on the application details. CSCI5417-IRHashes  Each vocabulary term is hashed to an integer  I assume you’ve seen hashtables before  Pros:  Lookup is faster than for a tree: O(1)  Cons:  No easy way to find minor variants:  judgment/judgement  No prefix search [tolerant retrieval]  If vocabulary keeps growing, need to occasionally rehash everything CSCI5417-IR6 Root a-m n-z a-hu hy-m n-sh si-z aardvark!huygens!sickle!zygot!Binary Tree Approach Sec. 3.1 CSCI5417-IRTree: B-tree  Definition: Every internal nodel has a number of children in the interval [a,b] where a, b are appropriate natural numbers, e.g., [2,4]. a-hu hy-m n-z CSCI5417-IR7 Trees  Simplest approach: binary trees  More typical : B-trees  Trees require a standard ordering of characters and hence strings … but we have that  Pros:  Facilitates prefix processing (terms starting with hyp)  Google’s “search as you type”  Cons:  Slower: O(log M) [and this requires balanced tree]  Rebalancing binary trees is expensive  But B-trees mitigate the rebalancing problem CSCI5417-IRBack to Query Processing  Users are so demanding...  In addition to phrasal queries, they like to  Use wild-card queries  Misspell stuff  So we better see what we can do about those things CSCI5417-IR8 CSCI5417-IRPhrasal queries  Want to handle queries such as  “Colorado Buffaloes” – as a phrase  This concept is popular with users; about 10% of ad hoc web queries are phrasal queries  Postings that consist of document lists alone are not sufficient to handle phrasal queries  Two general approaches  Word N-gram indexing  Positional indexing CSCI5417-IRPositional Indexing  Change the content of the postings  Store, for each term, entries of the form: <number of docs containing term; doc1: position1, position2 … ; doc2: position1, position2 … ; etc.>9 CSCI5417-IRPositional index example <be: 993427; 1: 7, 18, 33, 72, 86, 231; 2: 3, 149; 4: 17, 191, 291, 430, 434; 5: 363, 367, …> Which of docs 1,2,4,5 could contain “to be or not to be”? CSCI5417-IRProcessing a phrase query  Extract postings for each distinct term: to, be, or, not.  Merge their doc:position lists to enumerate all positions with “to be or not to be”.  to:  2:1,17,74,222,551; 4:8,16,190,429,433; 7:13,23,191; ...  be:  1:17,19; 4:17,191,291,430,434; 5:14,19,101; ...  Same general method for proximity searches (“near” operator).10 CSCI5417-IRRules of thumb  Positional index size 35–50% of volume of original text  Caveat: all of this holds for “English-like” languages CSCI5417-IRWild Card Queries  Two flavors  Word-based  Caribb*  Phrasal  “Pirates * Caribbean”  General approach  Spawn a new set of queries from the original query  Basically a dictionary operation  Run each of those queries in a not totally stupid way11 CSCI5417-IRSimple Single Wild-card Queries: *  Single instance of a *  * means an string of length 0 or more  This is not Kleene *.  mon*: find all docs containing any word beginning “mon”.  Using trees to implement the dictionary gives you prefixes  *mon: find words ending in “mon”  Maintain a backwards index Query processing  At this point, we have an enumeration of all terms in the dictionary that match the wild-card query.  We still have to look up the postings for each enumerated term.  For example, consider the query mon* AND octob* This results in the execution of many Boolean AND queries. CSCI5417-IR12 CSCI5417-IRArbitrary Wildcards  How can we handle *’s in the middle of query term?  The solution: transform every possible wild-card query so that the *’s occur at the end  This motivates the Permuterm Index  The dictionary/tree scheme remains the same; but we populate the dictionary with extra (special) terms CSCI5417-IRPermuterm Index  For the real term hello create entries under:  hello$, ello$h, llo$he, lo$hel, o$hell where $ is a special symbol.  Example Query = hel*o Add the $ = hel*o$ Rotate * to the back Lookup o$hel*13 Permuterm index  For term hello, index under:  hello$, ello$h, llo$he, lo$hel, o$hell


View Full Document

CU-Boulder CSCI 5417 - Lecture Notes

Download Lecture Notes
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture Notes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture Notes 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?