Villanova CSC 9010 - Infomation Retrieval and Web Mining - D2643981

Home> Schools> Villanova University> Computer Science (CSC) > CSC 9010> Infomation Retrieval and Web Mining

Villanova CSC 9010 - Infomation Retrieval and Web Mining

Course Csc 9010- TOP:High Performance Computing

Pages 40

Download Save

Unformatted text preview:

CS276 Infomation Retrieval and Web MiningQueryTerm-document incidenceIncidence vectorsAnswers to queryBigger corporaCan’t build the matrixInverted indexSlide 9Inverted index constructionIndexer stepsSlide 13Slide 14Slide 15The index we just builtQuery processingThe mergeBoolean queries: Exact matchExample: WestLaw http://www.westlaw.com/More general mergesMergingQuery optimizationQuery optimization exampleMore general optimizationExerciseQuery processing exercisesBeyond term searchEvidence accumulationRanking search resultsStructured vs unstructured dataUnstructured dataSemi-structured dataMore sophisticated semi-structured searchClustering and classificationThe web and its challengesSlide 37Course administriviaCourse staffResources for today’s lectureCS276Infomation Retrieval and Web MiningLecture 1QueryWhich plays of Shakespeare contain the words Brutus AND Caesar but NOT Calpurnia?One could grep all of Shakespeare’s plays for Brutus and Caesar, then strip out lines containing Calpurnia?Slow (for large corpora)NOT Calpurnia is non-trivialOther operations (e.g., find the word Romans near countrymen) not feasibleTerm-document incidence1 if play contains word, 0 otherwiseAntony and Cleopatra Julius Caesar The Tempest Hamlet Othello MacbethAntony 1 1 0 0 0 1Brutus 1 1 0 1 0 0Caesar 1 1 0 1 1 1Calpurnia 0 1 0 0 0 0Cleopatra 1 0 0 0 0 0mercy 1 0 1 1 1 1worser 1 0 1 1 1 0Brutus AND Caesar but NOT CalpurniaIncidence vectorsSo we have a 0/1 vector for each term.To answer query: take the vectors for Brutus, Caesar and Calpurnia (complemented)  bitwise AND.110100 AND 110111 AND 101111 = 100100.Answers to queryAntony and Cleopatra, Act III, Scene iiAgrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar dead, He cried almost to roaring; and he wept When at Philippi he found Brutus slain.Hamlet, Act III, Scene iiLord Polonius: I did enact Julius Caesar I was killed i' the Capitol; Brutus killed me.Bigger corporaConsider n = 1M documents, each with about 1K terms.Avg 6 bytes/term incl spaces/punctuation 6GB of data in the documents.Say there are m = 500K distinct terms among these.Can’t build the matrix500K x 1M matrix has half-a-trillion 0’s and 1’s.But it has no more than one billion 1’s.matrix is extremely sparse.What’s a better representation?We only record the 1 positions.Why?Inverted indexFor each term T, we must store a list of all documents that contain T.Do we use an array or a list for this?BrutusCalpurniaCaesar1 2 3 5 8 13 21 342 4 8 16 32 6412813 16What happens if the word Caesar is added to document 14?Inverted indexLinked lists generally preferred to arraysDynamic space allocationInsertion of terms into documents easySpace overhead of pointersBrutusCalpurniaCaesar2 4 8 16 32 64 1282 3 5 8 13 21 3413 161DictionaryPostingsSorted by docID (more later on why).Inverted index constructionTokenizerToken stream.FriendsRomans CountrymenLinguistic modulesModified tokens.friendroman countrymanIndexerInverted index.friendromancountryman2 4213161More onthese later.Documents tobe indexed.Friends, Romans, countrymen.Sequence of (Modified token, Document ID) pairs.I did enact JuliusCaesar I was killed i' the Capitol; Brutus killed me.Doc 1So let it be withCaesar. The nobleBrutus hath told youCaesar was ambitiousDoc 2Term Doc #I 1did 1enact 1julius 1caesar 1I 1was 1killed 1i' 1the 1capitol 1brutus 1killed 1me 1so 2let 2it 2be 2with 2caesar 2the 2noble 2brutus 2hath 2told 2you 2caesar 2was 2ambitious 2Indexer stepsSort by terms. Term Doc #ambitious2be 2brutus 1brutus 2capitol 1caesar 1caesar 2caesar 2did 1enact 1hath 1I 1I 1i' 1it 2julius 1killed 1killed 1let 2me 1noble 2so 2the 1the 2told 2you 2was 1was 2with 2 Term Doc #I 1did 1enact 1julius 1caesar 1I 1was 1killed 1i' 1the 1capitol 1brutus 1killed 1me 1so 2let 2it 2be 2with 2caesar 2the 2noble 2brutus 2hath 2told 2you 2caesar 2was 2ambitious 2Core indexing step.Multiple term entries in a single document are merged.Frequency information is added.Term Doc # Freqambitious 2 1be 2 1brutus 1 1brutus 2 1capitol 1 1caesar 1 1caesar 2 2did 1 1enact 1 1hath 2 1I 1 2i' 1 1it 2 1julius 1 1killed 1 2let 2 1me 1 1noble 2 1so 2 1the 1 1the 2 1told 2 1you 2 1was 1 1was 2 1with 2 1 Term Doc #ambitious2be 2brutus 1brutus 2capitol 1caesar 1caesar 2caesar 2did 1enact 1hath 1I 1I 1i' 1it 2julius 1killed 1killed 1let 2me 1noble 2so 2the 1the 2told 2you 2was 1was 2with 2Why frequency?Will discuss later.The result is split into a Dictionary file and a Postings file.Doc # Freq2 12 11 12 11 11 12 21 11 12 11 21 12 11 11 22 11 12 12 11 12 12 12 11 12 12 1 Term N docs Tot Freqambitious 1 1be 1 1brutus 2 2capitol 1 1caesar 2 3did 1 1enact 1 1hath 1 1I 1 2i' 1 1it 1 1julius 1 1killed 1 2let 1 1me 1 1noble 1 1so 1 1the 2 2told 1 1you 1 1was 2 2with 1 1 Term Doc # Freqambitious 2 1be 2 1brutus 1 1brutus 2 1capitol 1 1caesar 1 1caesar 2 2did 1 1enact 1 1hath 2 1I 1 2i' 1 1it 2 1julius 1 1killed 1 2let 2 1me 1 1noble 2 1so 2 1the 1 1the 2 1told 2 1you 2 1was 1 1was 2 1with 2 1Where do we pay in storage? Doc # Freq2 12 11 12 11 11 12 21 11 12 11 21 12 11 11 22 11 12 12 11 12 12 12 11 12 12 1 Term N docs Tot Freqambitious 1 1be 1 1brutus 2 2capitol 1 1caesar 2 3did 1 1enact 1 1hath 1 1I 1 2i' 1 1it 1 1julius 1 1killed 1 2let 1 1me 1 1noble 1 1so 1 1the 2 2told 1 1you 1 1was 2 2with 1 1PointersTermsWill quantify the storage, later.The index we just builtHow do we process a query?Later - what kinds of queries can we process?Today’s focusQuery processingConsider processing the query:Brutus AND CaesarLocate Brutus in the Dictionary;Retrieve its postings.Locate Caesar in the Dictionary;Retrieve its postings.“Merge” the two postings:128342 4 8 16 32 641 2 3 5 8 1321BrutusCaesar341282 4 8 1632 641 235 8 13 21The mergeWalk through the two postings simultaneously, in time linear in the total number of postings entries128342 4 8 16 32 641 2 3 5 8 13 21BrutusCaesar28If the list lengths are x and y, the merge takes O(x+y)operations.Crucial: postings sorted by docID.Boolean queries: Exact matchBoolean Queries are queries using AND, OR and NOT together with query termsViews each document as a set of wordsIs precise: document matches condition or not.Primary commercial retrieval tool for 3 decades. Professional searchers (e.g., lawyers)

View Full Document


School:
Email:
New Password:
Confirm Password:

Villanova CSC 9010 - Infomation Retrieval and Web Mining

Sign up for free to view:

Please select your school