Wright CS 707 - Inverted Index Construction

Unformatted text preview:

Inverted Index ConstructionUnstructured data in 1650Term-document incidenceIncidence vectorsAnswers to queryBasic assumptions of Information RetrievalThe classic search modelHow good are the retrieved docs?Bigger corporaCan’t build the matrixInverted indexSlide 12Inverted index constructionIndexer stepsSlide 16Slide 17Slide 18Query ProcessingQuery processing: ANDThe mergeBoolean queries: Exact matchExample: WestLaw http://www.westlaw.com/Slide 24Example: Using Search Connectors and Commands http://help.lexisnexis.com/Query optimizationQuery optimization exampleMore general optimizationSpace RequirementsSlide 30What’s ahead in IR? Beyond term searchOther Indexing TechniquesPrasad L3InvertedIndex 1Inverted Index ConstructionAdapted from Lectures byPrabhakar Raghavan (Yahoo and Stanford) and Christopher Manning (Stanford)Prasad L3InvertedIndex 2Unstructured data in 1650Which plays of Shakespeare contain the words Brutus AND Caesar but NOT Calpurnia ?One could grepgrep all of Shakespeare’s plays for Brutus and Caesar, then strip out plays containing Calpurnia ?Slow (for large corpora)Other operations (e.g., find the word Romans near countrymen) not feasiblePrasad L3InvertedIndex 3Term-document incidence1 if play contains word, 0 otherwiseAntony and Cleopatra Julius Caesar The Tempest Hamlet Othello MacbethAntony 1 1 0 0 0 1Brutus 1 1 0 1 0 0Caesar 1 1 0 1 1 1Calpurnia 0 1 0 0 0 0Cleopatra 1 0 0 0 0 0mercy 1 0 1 1 1 1worser 1 0 1 1 1 0Brutus AND Caesar but NOT CalpurniaPrasad L3InvertedIndex 4Incidence vectorsSo we have a 0/1 vector for each term.To answer query: take the vectors for Brutus, Caesar and Calpurnia (complemented)  bitwise AND.110100 AND 110111 AND 101111 = 100100.Prasad L3InvertedIndex 5Answers to queryAntony and Cleopatra, Act III, Scene iiAgrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar dead, He cried almost to roaring; and he wept When at Philippi he found Brutus slain.Hamlet, Act III, Scene iiLord Polonius: I did enact Julius Caesar I was killed i' the Capitol; Brutus killed me.Prasad 6Basic assumptions of Information RetrievalCollection: Fixed set of documentsGoal: Retrieve documents with information that is relevant to user’s information need and helps him complete a taskL3InvertedIndexThe classic search modelCorpusTASK Info NeedQuery Verbal formResultsSEARCHENGINEQueryRefinement Get rid of mice in a politically correct wayInfo about removing micewithout killing them How do I trap mice alive?mouse trapMis-conceptionMis-translationMis-formulation7Prasad 7L3InvertedIndexHow good are the retrieved docs?Precision : Fraction of retrieved docs that are relevant to user’s information needRecall : Fraction of relevant docs in collection that are retrievedMore precise definitions and measurements to follow in later lectures8Prasad L3InvertedIndexPrasad L3InvertedIndex 9Bigger corporaConsider N = 1M documents, each with about 1K terms.Avg 6 bytes/term including spaces/punctuation 6GB of data in the documents.Say there are m = 500K distinct terms among these.Prasad L3InvertedIndex 10Can’t build the matrix500K x 1M matrix has half-a-trillion 0’s and 1’s.But it has no more than one billion 1’s.matrix is extremely sparse.What’s a better representation?We only record the 1 positions.Why?Prasad L3InvertedIndex 11Inverted indexFor each term T, we must store a list of all documents that contain T.Do we use an array or a list for this?BrutusCalpurniaCaesar1 2 3 5 8 13 21 342 4 8 16 32 6412813 16What happens if the word Caesar is added to document 14?Prasad L3InvertedIndex 12Inverted indexLinked lists generally preferred to arrays+Dynamic space allocation+Insertion of terms into documents easy−Space overhead of pointersBrutusCalpurniaCaesar2 4 8 16 32 64 1282 3 5 8 13 21 3413 161DictionaryPostings listsSorted by docID (more later on why).PostingInverted index constructionTokenizerToken stream.FriendsRomans CountrymenLinguistic modulesModified tokens.friendroman countrymanIndexerInverted index.friendromancountryman2 4213161More onthese later.Documents tobe indexed.Friends, Romans, countrymen.Prasad 13L3InvertedIndexPrasad L3InvertedIndexSequence of (Modified token, Document ID) pairs.I did enact JuliusCaesar I was killed i' the Capitol; Brutus killed me.Doc 1So let it be withCaesar. The nobleBrutus hath told youCaesar was ambitiousDoc 2Indexer steps14Prasad L3InvertedIndexSort by terms. Core indexing step.15Multiple term entries in a single document are merged.Frequency information is added. Why frequency?Will discuss later.Prasad 16L3InvertedIndexL3InvertedIndexThe result is split into a Dictionary file and a Postings file. Prasad 17Prasad 18Where do we pay in storage? PointersTermsWill quantify the storage, later.L3InvertedIndexPrasad L3InvertedIndex 19Query ProcessingWhat?How?Prasad L3InvertedIndex 20Query processing: ANDConsider processing the query:Brutus AND CaesarLocate Brutus in the Dictionary;Retrieve its postings.Locate Caesar in the Dictionary;Retrieve its postings.“Merge” the two postings:128342 4 8 16 32 641 2 3 5 8 1321BrutusCaesarPrasad L3InvertedIndex 21341282 4 8 1632 641 235 8 13 21The mergeWalk through the two postings simultaneously, in time linear in the total number of postings entries128342 4 8 16 32 641 2 3 5 8 13 21BrutusCaesar28If the list lengths are x and y, the merge takes O(x+y)operations.Crucial: postings sorted by docID.Prasad L3InvertedIndex 22Boolean queries: Exact matchBoolean Queries are queries using AND, OR and NOT to join query termsViews each document as a set of wordsIs precise: document matches or not.Primary commercial retrieval tool for 3 decades. Professional searchers (e.g., lawyers) still like Boolean queries:You know exactly what you’re getting.Prasad L3InvertedIndex 23Example: WestLaw http://www.westlaw.com/Largest commercial (paying subscribers) legal search service (started 1975; ranking added 1992)Tens of terabytes of data; 700,000 usersMajority of users still use boolean queriesExample query:What is the statute of limitations in cases involving the federal tort claims act?LIMIT! /3 STATUTE ACTION /S FEDERAL /2 TORT /3 CLAIM/3 = within 3 words, /S = in same sentencePrasad L3InvertedIndex 24Example: WestLaw http://www.westlaw.com/Another example


View Full Document

Wright CS 707 - Inverted Index Construction

Download Inverted Index Construction
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Inverted Index Construction and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Inverted Index Construction 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?