CU-Boulder CSCI 5417 - Lecture Notes - D2528478

Home> Schools> University of Colorado at Boulder> Computer Science (CSCI) > CSCI 5417> Lecture Notes

CU-Boulder CSCI 5417 - Lecture Notes

School name University of Colorado at Boulder

Course Csci 5417- Information Retrieval Systems

Pages 22

Download Save

Unformatted text preview:

1 CSCI 5417 Information Retrieval Systems Jim Martin!Lecture 3 8/30/2010 CSCI5417-IRToday 8/30  Review  Conjunctive queries (intersect)  Dictionary contents  Phrasal queries  Tolerant query handling  Wildcards  Spelling correction2 CSCI5417-IRDoc # Freq2 12 11 12 11 11 12 21 11 12 11 21 12 11 11 22 11 12 12 11 12 12 12 11 12 12 1 Term N docs Coll freqambitious 1 1be 1 1brutus 2 2capitol 1 1caesar 2 3did 1 1enact 1 1hath 1 1I 1 2i' 1 1it 1 1julius 1 1killed 1 2let 1 1me 1 1noble 1 1so 1 1the 2 2told 1 1you 1 1was 2 2with 1 1Term Doc # Freqambitious 2 1be 2 1brutus 1 1brutus 2 1capitol 1 1caesar 1 1caesar 2 2did 1 1enact 1 1hath 2 1I 1 2i' 1 1it 2 1julius 1 1killed 1 2let 2 1me 1 1noble 2 1so 2 1the 1 1the 2 1told 2 1you 2 1was 1 1was 2 1with 2 1Index: Dictionary and Postings CSCI5417-IRBoolean AND: Intersection (1)3 CSCI5417-IRBoolean AND: Intersection (2) CSCI5417-IRReview: Dictionary  What goes into creating the terms that make it into the dictionary?  Tokenization  Case folding  Stemming  Stop-listing  Normalization  Dealing with numbers (and number-like entities)  Complex morphology4 Dictionary  The dictionary data structure stores the term vocabulary, document frequency, and pointers to each postings list. In what kind of data structure? CSCI5417-IRA Naïve Dictionary  An array of structs? char[20] int postings * 20 bytes 4/8 bytes 4/8 bytes  How do we quickly look up elements at query time? CSCI5417-IR5 Dictionary Data Structures  Two main choices:  Hash tables  Trees  Some IR systems use hashes, some trees. Choice depends on the application details. CSCI5417-IRHashes  Each vocabulary term is hashed to an integer  I assume you’ve seen hashtables before  Pros:  Lookup is faster than for a tree: O(1)  Cons:  No easy way to find minor variants:  judgment/judgement  No prefix search [tolerant retrieval]  If vocabulary keeps growing, need to occasionally rehash everything CSCI5417-IR6 Root a-m n-z a-hu hy-m n-sh si-z aardvark!huygens!sickle!zygot!Binary Tree Approach Sec. 3.1 CSCI5417-IRTree: B-tree  Definition: Every internal nodel has a number of children in the interval [a,b] where a, b are appropriate natural numbers, e.g., [2,4]. a-hu hy-m n-z CSCI5417-IR7 Trees  Simplest approach: binary trees  More typical : B-trees  Trees require a standard ordering of characters and hence strings … but we have that  Pros:  Facilitates prefix processing (terms starting with hyp)  Google’s “search as you type”  Cons:  Slower: O(log M) [and this requires balanced tree]  Rebalancing binary trees is expensive  But B-trees mitigate the rebalancing problem CSCI5417-IRBack to Query Processing  Users are so demanding...  In addition to phrasal queries, they like to  Use wild-card queries  Misspell stuff  So we better see what we can do about those things CSCI5417-IR8 CSCI5417-IRPhrasal queries  Want to handle queries such as  “Colorado Buffaloes” – as a phrase  This concept is popular with users; about 10% of ad hoc web queries are phrasal queries  Postings that consist of document lists alone are not sufficient to handle phrasal queries  Two general approaches  Word N-gram indexing  Positional indexing CSCI5417-IRPositional Indexing  Change the content of the postings  Store, for each term, entries of the form: <number of docs containing term; doc1: position1, position2 … ; doc2: position1, position2 … ; etc.>9 CSCI5417-IRPositional index example <be: 993427; 1: 7, 18, 33, 72, 86, 231; 2: 3, 149; 4: 17, 191, 291, 430, 434; 5: 363, 367, …> Which of docs 1,2,4,5 could contain “to be or not to be”? CSCI5417-IRProcessing a phrase query  Extract postings for each distinct term: to, be, or, not.  Merge their doc:position lists to enumerate all positions with “to be or not to be”.  to:  2:1,17,74,222,551; 4:8,16,190,429,433; 7:13,23,191; ...  be:  1:17,19; 4:17,191,291,430,434; 5:14,19,101; ...  Same general method for proximity searches (“near” operator).10 CSCI5417-IRRules of thumb  Positional index size 35–50% of volume of original text  Caveat: all of this holds for “English-like” languages CSCI5417-IRWild Card Queries  Two flavors  Word-based  Caribb*  Phrasal  “Pirates * Caribbean”  General approach  Spawn a new set of queries from the original query  Basically a dictionary operation  Run each of those queries in a not totally stupid way11 CSCI5417-IRSimple Single Wild-card Queries: *  Single instance of a *  * means an string of length 0 or more  This is not Kleene *.  mon*: find all docs containing any word beginning “mon”.  Using trees to implement the dictionary gives you prefixes  *mon: find words ending in “mon”  Maintain a backwards index Query processing  At this point, we have an enumeration of all terms in the dictionary that match the wild-card query.  We still have to look up the postings for each enumerated term.  For example, consider the query mon* AND octob* This results in the execution of many Boolean AND queries. CSCI5417-IR12 CSCI5417-IRArbitrary Wildcards  How can we handle *’s in the middle of query term?  The solution: transform every possible wild-card query so that the *’s occur at the end  This motivates the Permuterm Index  The dictionary/tree scheme remains the same; but we populate the dictionary with extra (special) terms CSCI5417-IRPermuterm Index  For the real term hello create entries under:  hello$, ello$h, llo$he, lo$hel, o$hell where $ is a special symbol.  Example Query = hel*o Add the $ = hel*o$ Rotate * to the back Lookup o$hel*13 Permuterm index  For term hello, index under:  hello$, ello$h, llo$he, lo$hel, o$hell

View Full Document


School:
Email:
New Password:
Confirm Password:

CU-Boulder CSCI 5417 - Lecture Notes

Sign up for free to view:

Please select your school