DOC PREVIEW
CU-Boulder CSCI 5417 - Lecture 2

This preview shows page 1-2-3-20-21-40-41-42 out of 42 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 42 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 42 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 42 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 42 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 42 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 42 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 42 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 42 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 42 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

CSCI 5417 Information Retrieval Systems Jim MartinToday 8/25Simple Unstructured Data ScenarioGrepping is Not an OptionTerm-Document MatrixIncidence VectorsAnswers to QueryBigger CollectionsThe MatrixInverted indexSlide 11Index CreationIndexer stepsSlide 15Slide 16IndexingGiven an IndexExample: WestLaw http://www.westlaw.com/Boolean queries: Exact matchQuery processing: ANDThe Merge (Intersection)Intersecting two postings lists (a “merge” algorithm)Query optimizationQuery optimization exampleMore general optimizationBreakAssignment 1: Due 9/1Slide 29Terms RevisitedTokenizationSlide 32NumbersTokenization: Language issuesSlide 35Tokenization: language issuesNormalizationNormalization: other languagesCase foldingStop wordsLemmatizationNext timeCSCI 5417Information Retrieval SystemsJim MartinLecture 28/25/201101/15/19 CSCI 5417 - IR 2Today 8/25Basic indexing, retrieval scenarioBoolean query processingMore on terms and tokens3Simple Unstructured Data ScenarioWhich plays of Shakespeare contain the words Brutus AND Caesar but NOT Calpurnia?We could grep all of Shakespeare’s plays for Brutus and Caesar, then strip out lines containing Calpurnia. This is problematic:Slow (for large corpora)NOT Calpurnia is non-trivialLines vs. Plays4Grepping is Not an OptionSo if we can’t search the documents in response to a query what can we do?Create a data structure up front that will facilitate the kind of searching we want to do.5Term-Document Matrix1 if play contains word, 0 otherwiseAntony and Cleopatra Julius Caesar The Tempest Hamlet Othello MacbethAntony 1 1 0 0 0 1Brutus 1 1 0 1 0 0Caesar 1 1 0 1 1 1Calpurnia 0 1 0 0 0 0Cleopatra 1 0 0 0 0 0mercy 1 0 1 1 1 1worser 1 0 1 1 1 0Brutus AND Caesar but NOT Calpurnia6Incidence VectorsSo we have a 0/1 vector for each termLength of the term vector = number of playsTo answer our query: take the vectors for Brutus, Caesar and Calpurnia(complemented) and then do a bitwise AND.110100 AND 110111 AND 101111 = 100100That is, plays 1 and 4“Antony and Cleopatra” and “Hamlet”7Answers to QueryAntony and Cleopatra, Act III, Scene iiAgrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar dead, He cried almost to roaring; and he wept When at Philippi he found Brutus slain.Hamlet, Act III, Scene iiLord Polonius: I did enact Julius Caesar I was killed i' the Capitol; Brutus killed me.8Bigger CollectionsConsider N = 1M documents, each with about 1K terms.Avg 6 bytes/term including spaces and punctuation 6GB of data just for the documents.Assume there are m = 500K distinct terms among these.Types9The Matrix500K x 1M matrix has 1/2 trillion entriesBut it has no more than one billion 1’sMatrix is extremely sparse.What’s the minimum number of 1’s in such an index?What’s a better representation?Forget the 0’s. Only record the 1’s.Why?01/15/19 CSCI 5417 - IR 10Inverted indexFor each term T, we must store a list of all documents that contain T.BrutusCalpurniaCaesar1 2 3 5 8 13 21 342 4 8 16 32 64 12813 16What happens if the word Caesar is later added to document 14?01/15/19 CSCI 5417 - IR 11Inverted indexLinked lists generally preferred to arraysDynamic space allocationInsertion of terms into documents easyBut there is the space overhead of pointersBrutusCalpurniaCaesar2 4 8 16 32 64 1282 3 5 8 13 21 3413 161DictionaryPostings listsSorted by docID (more later on why).Posting01/15/19 CSCI 5417 - IR 12Index CreationTokenizerToken stream.FriendsRomans CountrymenLinguistic modulesModified tokens.friendroman countrymanIndexerInverted index.friendromancountryman2 4213161Documents tobe indexed.Friends, Romans, countrymen.01/15/19 CSCI 5417 - IR 13From the documents generate a stream of (Modified token, Document ID) pairs.I did enact JuliusCaesar I was killed i' the Capitol; Brutus killed me.Doc 1So let it be withCaesar. The nobleBrutus hath told youCaesar was ambitiousDoc 2Term Doc #I 1did 1enact 1julius 1caesar 1I 1was 1killed 1i' 1the 1capitol 1brutus 1killed 1me 1so 2let 2it 2be 2with 2caesar 2the 2noble 2brutus 2hath 2told 2you 2caesar 2was 2ambitious 2Indexer steps01/15/19 CSCI 5417 - IR 14Sort pairs by terms. Term Doc #ambitious2be 2brutus 1brutus 2capitol 1caesar 1caesar 2caesar 2did 1enact 1hath 1I 1I 1i' 1it 2julius 1killed 1killed 1let 2me 1noble 2so 2the 1the 2told 2you 2was 1was 2with 2 Term Doc #I 1did 1enact 1julius 1caesar 1I 1was 1killed 1i' 1the 1capitol 1brutus 1killed 1me 1so 2let 2it 2be 2with 2caesar 2the 2noble 2brutus 2hath 2told 2you 2caesar 2was 2ambitious 2Core indexing step.01/15/19 CSCI 5417 - IR 15Multiple term entries in a single document are merged.Frequency information is added.Term Doc # Term freqambitious 2 1be 2 1brutus 1 1brutus 2 1capitol 1 1caesar 1 1caesar 2 2did 1 1enact 1 1hath 2 1I 1 2i' 1 1it 2 1julius 1 1killed 1 2let 2 1me 1 1noble 2 1so 2 1the 1 1the 2 1told 2 1you 2 1was 1 1was 2 1with 2 1 Term Doc #ambitious2be 2brutus 1brutus 2capitol 1caesar 1caesar 2caesar 2did 1enact 1hath 1I 1I 1i' 1it 2julius 1killed 1killed 1let 2me 1noble 2so 2the 1the 2told 2you 2was 1was 2with 201/15/19 CSCI 5417 - IR 16The result is then split into a Dictionary file and a Postings file.Doc # Freq2 12 11 12 11 11 12 21 11 12 11 21 12 11 11 22 11 12 12 11 12 12 12 11 12 12 1 Term N docs Coll freqambitious 1 1be 1 1brutus 2 2capitol 1 1caesar 2 3did 1 1enact 1 1hath 1 1I 1 2i' 1 1it 1 1julius 1 1killed 1 2let 1 1me 1 1noble 1 1so 1 1the 2 2told 1 1you 1 1was 2 2with 1 1 Term Doc # Freqambitious 2 1be 2 1brutus 1 1brutus 2 1capitol 1 1caesar 1 1caesar 2 2did 1 1enact 1 1hath 2 1I 1 2i' 1 1it 2 1julius 1 1killed 1 2let 2 1me 1 1noble 2 1so 2 1the 1 1the 2 1told 2 1you 2 1was 1 1was 2 1with 2 1Where’s the primary storage cost?Why split into two files?IndexingOf course you wouldn’t really do it that way for large collections. Why?17The indexer would be too slow01/15/19 CSCI 5417 - IR 18Given an IndexSo what is such an index good for?Processing queries to get documentsWhat’s a query?An encoding of a user’s information needFor now we’ll keep it simple: boolean logic over terms.01/15/19 CSCI 5417 - IR 19Example: WestLaw http://www.westlaw.com/Largest commercial (paying subscribers) legal search service (started 1975; ranking added 1992)Tens of terabytes of data; 700,000 usersMajority of users still


View Full Document

CU-Boulder CSCI 5417 - Lecture 2

Download Lecture 2
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture 2 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture 2 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?