CS276 Infomation Retrieval and Web MiningQueryTerm-document incidenceIncidence vectorsAnswers to queryBigger corporaCan’t build the matrixInverted indexSlide 9Inverted index constructionIndexer stepsSlide 13Slide 14Slide 15The index we just builtQuery processingThe mergeBoolean queries: Exact matchExample: WestLaw http://www.westlaw.com/More general mergesMergingQuery optimizationQuery optimization exampleMore general optimizationExerciseQuery processing exercisesBeyond term searchEvidence accumulationRanking search resultsStructured vs unstructured dataUnstructured dataSemi-structured dataMore sophisticated semi-structured searchClustering and classificationThe web and its challengesSlide 37Course administriviaCourse staffResources for today’s lectureCS276Infomation Retrieval and Web MiningLecture 1QueryWhich plays of Shakespeare contain the words Brutus AND Caesar but NOT Calpurnia?One could grep all of Shakespeare’s plays for Brutus and Caesar, then strip out lines containing Calpurnia?Slow (for large corpora)NOT Calpurnia is non-trivialOther operations (e.g., find the word Romans near countrymen) not feasibleTerm-document incidence1 if play contains word, 0 otherwiseAntony and Cleopatra Julius Caesar The Tempest Hamlet Othello MacbethAntony 1 1 0 0 0 1Brutus 1 1 0 1 0 0Caesar 1 1 0 1 1 1Calpurnia 0 1 0 0 0 0Cleopatra 1 0 0 0 0 0mercy 1 0 1 1 1 1worser 1 0 1 1 1 0Brutus AND Caesar but NOT CalpurniaIncidence vectorsSo we have a 0/1 vector for each term.To answer query: take the vectors for Brutus, Caesar and Calpurnia (complemented) bitwise AND.110100 AND 110111 AND 101111 = 100100.Answers to queryAntony and Cleopatra, Act III, Scene iiAgrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar dead, He cried almost to roaring; and he wept When at Philippi he found Brutus slain.Hamlet, Act III, Scene iiLord Polonius: I did enact Julius Caesar I was killed i' the Capitol; Brutus killed me.Bigger corporaConsider n = 1M documents, each with about 1K terms.Avg 6 bytes/term incl spaces/punctuation 6GB of data in the documents.Say there are m = 500K distinct terms among these.Can’t build the matrix500K x 1M matrix has half-a-trillion 0’s and 1’s.But it has no more than one billion 1’s.matrix is extremely sparse.What’s a better representation?We only record the 1 positions.Why?Inverted indexFor each term T, we must store a list of all documents that contain T.Do we use an array or a list for this?BrutusCalpurniaCaesar1 2 3 5 8 13 21 342 4 8 16 32 6412813 16What happens if the word Caesar is added to document 14?Inverted indexLinked lists generally preferred to arraysDynamic space allocationInsertion of terms into documents easySpace overhead of pointersBrutusCalpurniaCaesar2 4 8 16 32 64 1282 3 5 8 13 21 3413 161DictionaryPostingsSorted by docID (more later on why).Inverted index constructionTokenizerToken stream.FriendsRomans CountrymenLinguistic modulesModified tokens.friendroman countrymanIndexerInverted index.friendromancountryman2 4213161More onthese later.Documents tobe indexed.Friends, Romans, countrymen.Sequence of (Modified token, Document ID) pairs.I did enact JuliusCaesar I was killed i' the Capitol; Brutus killed me.Doc 1So let it be withCaesar. The nobleBrutus hath told youCaesar was ambitiousDoc 2Term Doc #I 1did 1enact 1julius 1caesar 1I 1was 1killed 1i' 1the 1capitol 1brutus 1killed 1me 1so 2let 2it 2be 2with 2caesar 2the 2noble 2brutus 2hath 2told 2you 2caesar 2was 2ambitious 2Indexer stepsSort by terms. Term Doc #ambitious2be 2brutus 1brutus 2capitol 1caesar 1caesar 2caesar 2did 1enact 1hath 1I 1I 1i' 1it 2julius 1killed 1killed 1let 2me 1noble 2so 2the 1the 2told 2you 2was 1was 2with 2 Term Doc #I 1did 1enact 1julius 1caesar 1I 1was 1killed 1i' 1the 1capitol 1brutus 1killed 1me 1so 2let 2it 2be 2with 2caesar 2the 2noble 2brutus 2hath 2told 2you 2caesar 2was 2ambitious 2Core indexing step.Multiple term entries in a single document are merged.Frequency information is added.Term Doc # Freqambitious 2 1be 2 1brutus 1 1brutus 2 1capitol 1 1caesar 1 1caesar 2 2did 1 1enact 1 1hath 2 1I 1 2i' 1 1it 2 1julius 1 1killed 1 2let 2 1me 1 1noble 2 1so 2 1the 1 1the 2 1told 2 1you 2 1was 1 1was 2 1with 2 1 Term Doc #ambitious2be 2brutus 1brutus 2capitol 1caesar 1caesar 2caesar 2did 1enact 1hath 1I 1I 1i' 1it 2julius 1killed 1killed 1let 2me 1noble 2so 2the 1the 2told 2you 2was 1was 2with 2Why frequency?Will discuss later.The result is split into a Dictionary file and a Postings file.Doc # Freq2 12 11 12 11 11 12 21 11 12 11 21 12 11 11 22 11 12 12 11 12 12 12 11 12 12 1 Term N docs Tot Freqambitious 1 1be 1 1brutus 2 2capitol 1 1caesar 2 3did 1 1enact 1 1hath 1 1I 1 2i' 1 1it 1 1julius 1 1killed 1 2let 1 1me 1 1noble 1 1so 1 1the 2 2told 1 1you 1 1was 2 2with 1 1 Term Doc # Freqambitious 2 1be 2 1brutus 1 1brutus 2 1capitol 1 1caesar 1 1caesar 2 2did 1 1enact 1 1hath 2 1I 1 2i' 1 1it 2 1julius 1 1killed 1 2let 2 1me 1 1noble 2 1so 2 1the 1 1the 2 1told 2 1you 2 1was 1 1was 2 1with 2 1Where do we pay in storage? Doc # Freq2 12 11 12 11 11 12 21 11 12 11 21 12 11 11 22 11 12 12 11 12 12 12 11 12 12 1 Term N docs Tot Freqambitious 1 1be 1 1brutus 2 2capitol 1 1caesar 2 3did 1 1enact 1 1hath 1 1I 1 2i' 1 1it 1 1julius 1 1killed 1 2let 1 1me 1 1noble 1 1so 1 1the 2 2told 1 1you 1 1was 2 2with 1 1PointersTermsWill quantify the storage, later.The index we just builtHow do we process a query?Later - what kinds of queries can we process?Today’s focusQuery processingConsider processing the query:Brutus AND CaesarLocate Brutus in the Dictionary;Retrieve its postings.Locate Caesar in the Dictionary;Retrieve its postings.“Merge” the two postings:128342 4 8 16 32 641 2 3 5 8 1321BrutusCaesar341282 4 8 1632 641 235 8 13 21The mergeWalk through the two postings simultaneously, in time linear in the total number of postings entries128342 4 8 16 32 641 2 3 5 8 13 21BrutusCaesar28If the list lengths are x and y, the merge takes O(x+y)operations.Crucial: postings sorted by docID.Boolean queries: Exact matchBoolean Queries are queries using AND, OR and NOT together with query termsViews each document as a set of wordsIs precise: document matches condition or not.Primary commercial retrieval tool for 3 decades. Professional searchers (e.g., lawyers)
View Full Document