DOC PREVIEW
CU-Boulder CSCI 5417 - Lecture 3

This preview shows page 1-2-3-20-21-22-41-42-43 out of 43 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 43 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 43 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 43 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 43 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 43 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 43 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 43 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 43 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 43 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 43 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

CSCI 5417 Information Retrieval Systems Jim MartinToday 8/30Index: Dictionary and PostingsBoolean AND: Intersection (1)Boolean AND: Intersection (2)Review: DictionaryDictionaryA Naïve DictionaryDictionary Data StructuresHashesBinary Tree ApproachTree: B-treeTreesBack to Query ProcessingPhrasal queriesPositional IndexingPositional index exampleProcessing a phrase queryRules of thumbWild Card QueriesSimple Single Wild-card Queries: *Query processingArbitrary WildcardsPermuterm IndexPermuterm indexPermuterm query processingNotice...Programming Assignment 1Spelling CorrectionDocument correctionGoogle BooksSlide 32Query correctionIsolated word correctionSlide 35Edit distanceWeighted edit distanceUsing edit distancesEdit distance to all dictionary terms?Context-sensitive spell correctionContext-sensitive correctionGeneral issue in spell correctionNext TimeCSCI 5417Information Retrieval SystemsJim MartinLecture 38/30/2010CSCI 5417 - IRToday 8/30ReviewConjunctive queries (intersect)Dictionary contentsPhrasal queriesTolerant query handlingWildcardsSpelling correctionCSCI 5417 - IRDoc # Freq2 12 11 12 11 11 12 21 11 12 11 21 12 11 11 22 11 12 12 11 12 12 12 11 12 12 1 Term N docs Coll freqambitious 1 1be 1 1brutus 2 2capitol 1 1caesar 2 3did 1 1enact 1 1hath 1 1I 1 2i' 1 1it 1 1julius 1 1killed 1 2let 1 1me 1 1noble 1 1so 1 1the 2 2told 1 1you 1 1was 2 2with 1 1Term Doc # Freqambitious 2 1be 2 1brutus 1 1brutus 2 1capitol 1 1caesar 1 1caesar 2 2did 1 1enact 1 1hath 2 1I 1 2i' 1 1it 2 1julius 1 1killed 1 2let 2 1me 1 1noble 2 1so 2 1the 1 1the 2 1told 2 1you 2 1was 1 1was 2 1with 2 1Index: Dictionary and PostingsCSCI 5417 - IRBoolean AND: Intersection (1)CSCI 5417 - IRBoolean AND: Intersection (2)CSCI 5417 - IRReview: DictionaryWhat goes into creating the terms that make it into the dictionary?TokenizationCase foldingStemmingStop-listingNormalizationDealing with numbers (and number-like entities)Complex morphologyDictionaryThe dictionary data structure stores the term vocabulary, document frequency, and pointers to each postings list. In what kind of data structure?CSCI 5417 - IRA Naïve DictionaryAn array of structs? char[20] int postings * 20 bytes 4/8 bytes 4/8 bytes How do we quickly look up elements at query time?CSCI 5417 - IRDictionary Data StructuresTwo main choices:Hash tablesTreesSome IR systems use hashes, some trees. Choice depends on the application details.CSCI 5417 - IRHashesEach vocabulary term is hashed to an integerI assume you’ve seen hashtables beforePros:Lookup is faster than for a tree: O(1)Cons:No easy way to find minor variants:judgment/judgementNo prefix search [tolerant retrieval]If vocabulary keeps growing, need to occasionally rehash everythingCSCI 5417 - IRRoota-mn-za-hu hy-m n-sh si-zaardvarkhuygenssicklezygotBinary Tree ApproachSec. 3.1CSCI 5417 - IRTree: B-treeDefinition: Every internal nodel has a number of children in the interval [a,b] where a, b are appropriate natural numbers, e.g., [2,4].a-huhy-mn-zCSCI 5417 - IRTreesSimplest approach: binary treesMore typical : B-treesTrees require a standard ordering of characters and hence strings … but we have thatPros:Facilitates prefix processing (terms starting with hyp)Google’s “search as you type”Cons:Slower: O(log M) [and this requires balanced tree]Rebalancing binary trees is expensiveBut B-trees mitigate the rebalancing problemCSCI 5417 - IRBack to Query ProcessingUsers are so demanding...In addition to phrasal queries, they like to Use wild-card queriesMisspell stufSo we better see what we can do about those thingsCSCI 5417 - IRCSCI 5417 - IRPhrasal queriesWant to handle queries such as “Colorado Buffaloes” – as a phraseThis concept is popular with users; about 10% of ad hoc web queries are phrasal queriesPostings that consist of document lists alone are not sufficient to handle phrasal queriesTwo general approachesWord N-gram indexingPositional indexingCSCI 5417 - IRPositional IndexingChange the content of the postingsStore, for each term, entries of the form:<number of docs containing term;doc1: position1, position2 … ;doc2: position1, position2 … ;etc.>CSCI 5417 - IRPositional index example<be: 993427;1: 7, 18, 33, 72, 86, 231;2: 3, 149;4: 17, 191, 291, 430, 434;5: 363, 367, …>Which of docs 1,2,4,5could contain “to beor not to be”?CSCI 5417 - IRProcessing a phrase queryExtract postings for each distinct term: to, be, or, not.Merge their doc:position lists to enumerate all positions with “to be or not to be”.to: 2:1,17,74,222,551; 4:8,16,190,429,433; 7:13,23,191; ...be: 1:17,19; 4:17,191,291,430,434; 5:14,19,101; ...Same general method for proximity searches (“near” operator).CSCI 5417 - IRRules of thumbPositional index size 35–50% of volume of original textCaveat: all of this holds for “English-like” languagesCSCI 5417 - IRWild Card QueriesTwo flavorsWord-basedCaribb*Phrasal“Pirates * Caribbean”General approachSpawn a new set of queries from the original queryBasically a dictionary operationRun each of those queries in a not totally stupid wayCSCI 5417 - IRSimple Single Wild-card Queries: *Single instance of a ** means an string of length 0 or moreThis is not Kleene *.mon*: find all docs containing any word beginning “mon”.Using trees to implement the dictionary gives you prefixes*mon: find words ending in “mon”Maintain a backwards indexQuery processingAt this point, we have an enumeration of all terms in the dictionary that match the wild-card query.We still have to look up the postings for each enumerated term.For example, consider the querymon* AND octob*This results in the execution of many Boolean AND queries.CSCI 5417 - IRCSCI 5417 - IRArbitrary WildcardsHow can we handle *’s in the middle of query term?The solution: transform every possible wild-card query so that the *’s occur at the endThis motivates the Permuterm IndexThe dictionary/tree scheme remains the same; but we populate the dictionary with extra (special) termsCSCI 5417 - IRPermuterm IndexFor the real term hello create entries under:hello$, ello$h, llo$he, lo$hel, o$hellwhere $ is a special symbol.ExampleQuery = hel*oAdd the $= hel*o$Rotate * to the backLookup o$hel*Permuterm


View Full Document

CU-Boulder CSCI 5417 - Lecture 3

Download Lecture 3
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture 3 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture 3 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?