CSCI 5417 Information Retrieval Systems Jim MartinToday 8/25Simple Unstructured Data ScenarioGrepping is Not an OptionTerm-Document MatrixIncidence VectorsAnswers to QueryBigger CollectionsThe MatrixInverted indexSlide 11Index CreationIndexer stepsSlide 15Slide 16IndexingGiven an IndexExample: WestLaw http://www.westlaw.com/Boolean queries: Exact matchQuery processing: ANDThe Merge (Intersection)Intersecting two postings lists (a “merge” algorithm)Query optimizationQuery optimization exampleMore general optimizationBreakAssignment 1: Due 9/1Slide 29Terms RevisitedTokenizationSlide 32NumbersTokenization: Language issuesSlide 35Tokenization: language issuesNormalizationNormalization: other languagesCase foldingStop wordsLemmatizationNext timeCSCI 5417Information Retrieval SystemsJim MartinLecture 28/25/201101/15/19 CSCI 5417 - IR 2Today 8/25Basic indexing, retrieval scenarioBoolean query processingMore on terms and tokens3Simple Unstructured Data ScenarioWhich plays of Shakespeare contain the words Brutus AND Caesar but NOT Calpurnia?We could grep all of Shakespeare’s plays for Brutus and Caesar, then strip out lines containing Calpurnia. This is problematic:Slow (for large corpora)NOT Calpurnia is non-trivialLines vs. Plays4Grepping is Not an OptionSo if we can’t search the documents in response to a query what can we do?Create a data structure up front that will facilitate the kind of searching we want to do.5Term-Document Matrix1 if play contains word, 0 otherwiseAntony and Cleopatra Julius Caesar The Tempest Hamlet Othello MacbethAntony 1 1 0 0 0 1Brutus 1 1 0 1 0 0Caesar 1 1 0 1 1 1Calpurnia 0 1 0 0 0 0Cleopatra 1 0 0 0 0 0mercy 1 0 1 1 1 1worser 1 0 1 1 1 0Brutus AND Caesar but NOT Calpurnia6Incidence VectorsSo we have a 0/1 vector for each termLength of the term vector = number of playsTo answer our query: take the vectors for Brutus, Caesar and Calpurnia(complemented) and then do a bitwise AND.110100 AND 110111 AND 101111 = 100100That is, plays 1 and 4“Antony and Cleopatra” and “Hamlet”7Answers to QueryAntony and Cleopatra, Act III, Scene iiAgrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar dead, He cried almost to roaring; and he wept When at Philippi he found Brutus slain.Hamlet, Act III, Scene iiLord Polonius: I did enact Julius Caesar I was killed i' the Capitol; Brutus killed me.8Bigger CollectionsConsider N = 1M documents, each with about 1K terms.Avg 6 bytes/term including spaces and punctuation 6GB of data just for the documents.Assume there are m = 500K distinct terms among these.Types9The Matrix500K x 1M matrix has 1/2 trillion entriesBut it has no more than one billion 1’sMatrix is extremely sparse.What’s the minimum number of 1’s in such an index?What’s a better representation?Forget the 0’s. Only record the 1’s.Why?01/15/19 CSCI 5417 - IR 10Inverted indexFor each term T, we must store a list of all documents that contain T.BrutusCalpurniaCaesar1 2 3 5 8 13 21 342 4 8 16 32 64 12813 16What happens if the word Caesar is later added to document 14?01/15/19 CSCI 5417 - IR 11Inverted indexLinked lists generally preferred to arraysDynamic space allocationInsertion of terms into documents easyBut there is the space overhead of pointersBrutusCalpurniaCaesar2 4 8 16 32 64 1282 3 5 8 13 21 3413 161DictionaryPostings listsSorted by docID (more later on why).Posting01/15/19 CSCI 5417 - IR 12Index CreationTokenizerToken stream.FriendsRomans CountrymenLinguistic modulesModified tokens.friendroman countrymanIndexerInverted index.friendromancountryman2 4213161Documents tobe indexed.Friends, Romans, countrymen.01/15/19 CSCI 5417 - IR 13From the documents generate a stream of (Modified token, Document ID) pairs.I did enact JuliusCaesar I was killed i' the Capitol; Brutus killed me.Doc 1So let it be withCaesar. The nobleBrutus hath told youCaesar was ambitiousDoc 2Term Doc #I 1did 1enact 1julius 1caesar 1I 1was 1killed 1i' 1the 1capitol 1brutus 1killed 1me 1so 2let 2it 2be 2with 2caesar 2the 2noble 2brutus 2hath 2told 2you 2caesar 2was 2ambitious 2Indexer steps01/15/19 CSCI 5417 - IR 14Sort pairs by terms. Term Doc #ambitious2be 2brutus 1brutus 2capitol 1caesar 1caesar 2caesar 2did 1enact 1hath 1I 1I 1i' 1it 2julius 1killed 1killed 1let 2me 1noble 2so 2the 1the 2told 2you 2was 1was 2with 2 Term Doc #I 1did 1enact 1julius 1caesar 1I 1was 1killed 1i' 1the 1capitol 1brutus 1killed 1me 1so 2let 2it 2be 2with 2caesar 2the 2noble 2brutus 2hath 2told 2you 2caesar 2was 2ambitious 2Core indexing step.01/15/19 CSCI 5417 - IR 15Multiple term entries in a single document are merged.Frequency information is added.Term Doc # Term freqambitious 2 1be 2 1brutus 1 1brutus 2 1capitol 1 1caesar 1 1caesar 2 2did 1 1enact 1 1hath 2 1I 1 2i' 1 1it 2 1julius 1 1killed 1 2let 2 1me 1 1noble 2 1so 2 1the 1 1the 2 1told 2 1you 2 1was 1 1was 2 1with 2 1 Term Doc #ambitious2be 2brutus 1brutus 2capitol 1caesar 1caesar 2caesar 2did 1enact 1hath 1I 1I 1i' 1it 2julius 1killed 1killed 1let 2me 1noble 2so 2the 1the 2told 2you 2was 1was 2with 201/15/19 CSCI 5417 - IR 16The result is then split into a Dictionary file and a Postings file.Doc # Freq2 12 11 12 11 11 12 21 11 12 11 21 12 11 11 22 11 12 12 11 12 12 12 11 12 12 1 Term N docs Coll freqambitious 1 1be 1 1brutus 2 2capitol 1 1caesar 2 3did 1 1enact 1 1hath 1 1I 1 2i' 1 1it 1 1julius 1 1killed 1 2let 1 1me 1 1noble 1 1so 1 1the 2 2told 1 1you 1 1was 2 2with 1 1 Term Doc # Freqambitious 2 1be 2 1brutus 1 1brutus 2 1capitol 1 1caesar 1 1caesar 2 2did 1 1enact 1 1hath 2 1I 1 2i' 1 1it 2 1julius 1 1killed 1 2let 2 1me 1 1noble 2 1so 2 1the 1 1the 2 1told 2 1you 2 1was 1 1was 2 1with 2 1Where’s the primary storage cost?Why split into two files?IndexingOf course you wouldn’t really do it that way for large collections. Why?17The indexer would be too slow01/15/19 CSCI 5417 - IR 18Given an IndexSo what is such an index good for?Processing queries to get documentsWhat’s a query?An encoding of a user’s information needFor now we’ll keep it simple: boolean logic over terms.01/15/19 CSCI 5417 - IR 19Example: WestLaw http://www.westlaw.com/Largest commercial (paying subscribers) legal search service (started 1975; ranking added 1992)Tens of terabytes of data; 700,000 usersMajority of users still
View Full Document