1 CSCI 5417 Information Retrieval Systems Jim Martin!Lecture 3 8/30/2010 CSCI5417-IRToday 8/30 Review Conjunctive queries (intersect) Dictionary contents Phrasal queries Tolerant query handling Wildcards Spelling correction2 CSCI5417-IRDoc # Freq2 12 11 12 11 11 12 21 11 12 11 21 12 11 11 22 11 12 12 11 12 12 12 11 12 12 1 Term N docs Coll freqambitious 1 1be 1 1brutus 2 2capitol 1 1caesar 2 3did 1 1enact 1 1hath 1 1I 1 2i' 1 1it 1 1julius 1 1killed 1 2let 1 1me 1 1noble 1 1so 1 1the 2 2told 1 1you 1 1was 2 2with 1 1Term Doc # Freqambitious 2 1be 2 1brutus 1 1brutus 2 1capitol 1 1caesar 1 1caesar 2 2did 1 1enact 1 1hath 2 1I 1 2i' 1 1it 2 1julius 1 1killed 1 2let 2 1me 1 1noble 2 1so 2 1the 1 1the 2 1told 2 1you 2 1was 1 1was 2 1with 2 1Index: Dictionary and Postings CSCI5417-IRBoolean AND: Intersection (1)3 CSCI5417-IRBoolean AND: Intersection (2) CSCI5417-IRReview: Dictionary What goes into creating the terms that make it into the dictionary? Tokenization Case folding Stemming Stop-listing Normalization Dealing with numbers (and number-like entities) Complex morphology4 Dictionary The dictionary data structure stores the term vocabulary, document frequency, and pointers to each postings list. In what kind of data structure? CSCI5417-IRA Naïve Dictionary An array of structs? char[20] int postings * 20 bytes 4/8 bytes 4/8 bytes How do we quickly look up elements at query time? CSCI5417-IR5 Dictionary Data Structures Two main choices: Hash tables Trees Some IR systems use hashes, some trees. Choice depends on the application details. CSCI5417-IRHashes Each vocabulary term is hashed to an integer I assume you’ve seen hashtables before Pros: Lookup is faster than for a tree: O(1) Cons: No easy way to find minor variants: judgment/judgement No prefix search [tolerant retrieval] If vocabulary keeps growing, need to occasionally rehash everything CSCI5417-IR6 Root a-m n-z a-hu hy-m n-sh si-z aardvark!huygens!sickle!zygot!Binary Tree Approach Sec. 3.1 CSCI5417-IRTree: B-tree Definition: Every internal nodel has a number of children in the interval [a,b] where a, b are appropriate natural numbers, e.g., [2,4]. a-hu hy-m n-z CSCI5417-IR7 Trees Simplest approach: binary trees More typical : B-trees Trees require a standard ordering of characters and hence strings … but we have that Pros: Facilitates prefix processing (terms starting with hyp) Google’s “search as you type” Cons: Slower: O(log M) [and this requires balanced tree] Rebalancing binary trees is expensive But B-trees mitigate the rebalancing problem CSCI5417-IRBack to Query Processing Users are so demanding... In addition to phrasal queries, they like to Use wild-card queries Misspell stuff So we better see what we can do about those things CSCI5417-IR8 CSCI5417-IRPhrasal queries Want to handle queries such as “Colorado Buffaloes” – as a phrase This concept is popular with users; about 10% of ad hoc web queries are phrasal queries Postings that consist of document lists alone are not sufficient to handle phrasal queries Two general approaches Word N-gram indexing Positional indexing CSCI5417-IRPositional Indexing Change the content of the postings Store, for each term, entries of the form: <number of docs containing term; doc1: position1, position2 … ; doc2: position1, position2 … ; etc.>9 CSCI5417-IRPositional index example <be: 993427; 1: 7, 18, 33, 72, 86, 231; 2: 3, 149; 4: 17, 191, 291, 430, 434; 5: 363, 367, …> Which of docs 1,2,4,5 could contain “to be or not to be”? CSCI5417-IRProcessing a phrase query Extract postings for each distinct term: to, be, or, not. Merge their doc:position lists to enumerate all positions with “to be or not to be”. to: 2:1,17,74,222,551; 4:8,16,190,429,433; 7:13,23,191; ... be: 1:17,19; 4:17,191,291,430,434; 5:14,19,101; ... Same general method for proximity searches (“near” operator).10 CSCI5417-IRRules of thumb Positional index size 35–50% of volume of original text Caveat: all of this holds for “English-like” languages CSCI5417-IRWild Card Queries Two flavors Word-based Caribb* Phrasal “Pirates * Caribbean” General approach Spawn a new set of queries from the original query Basically a dictionary operation Run each of those queries in a not totally stupid way11 CSCI5417-IRSimple Single Wild-card Queries: * Single instance of a * * means an string of length 0 or more This is not Kleene *. mon*: find all docs containing any word beginning “mon”. Using trees to implement the dictionary gives you prefixes *mon: find words ending in “mon” Maintain a backwards index Query processing At this point, we have an enumeration of all terms in the dictionary that match the wild-card query. We still have to look up the postings for each enumerated term. For example, consider the query mon* AND octob* This results in the execution of many Boolean AND queries. CSCI5417-IR12 CSCI5417-IRArbitrary Wildcards How can we handle *’s in the middle of query term? The solution: transform every possible wild-card query so that the *’s occur at the end This motivates the Permuterm Index The dictionary/tree scheme remains the same; but we populate the dictionary with extra (special) terms CSCI5417-IRPermuterm Index For the real term hello create entries under: hello$, ello$h, llo$he, lo$hel, o$hell where $ is a special symbol. Example Query = hel*o Add the $ = hel*o$ Rotate * to the back Lookup o$hel*13 Permuterm index For term hello, index under: hello$, ello$h, llo$he, lo$hel, o$hell
View Full Document