CS347 Lecture 1 April 4 2001 Prabhakar Raghavan Query Which plays of Shakespeare contain the words Brutus AND Caesar but NOT Calpurnia Term document incidence Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth Antony 1 1 0 0 0 1 Brutus 1 1 0 1 0 0 Caesar 1 1 0 1 1 1 Calpurnia 0 1 0 0 0 0 Cleopatra 1 0 0 0 0 0 mercy 1 0 1 1 1 1 worser 1 0 1 1 1 0 1 if play contains word 0 otherwise Incidence vectors So we have a 0 1 vector for each term To answer query take the vectors for Brutus Caesar and Calpurnia complemented bitwise AND 110100 AND 110111 AND 101111 100100 Answers to query Antony and Cleopatra Act III Scene ii Agrippa Aside to DOMITIUS ENOBARBUS Why Enobarbus When Antony found Julius Caesar dead He cried almost to roaring and he wept When at Philippi he found Brutus slain Hamlet Act III Scene ii Lord Polonius I did enact Julius Caesar I was killed i the Capitol Brutus killed me Bigger corpora Consider n 1M documents each with about 1K terms Avg 6 bytes term incl spaces punctuation 6GB of data Say there are m 500K distinct terms among these Can t build the matrix 500K x 1M matrix has half a trillion 0 s and 1 s But it has no more than one billion 1 s matrix is extremely sparse What s a better representation Why Inverted index Documents are parsed to extract words and these are saved with the Document ID Doc 1 I did enact Julius Caesar I was killed i the Capitol Brutus killed me Doc 2 So let it be with Caesar The noble Brutus hath told you Caesar was ambitious Term I did enact julius caesar I was killed i the capitol brutus killed me so let it be with caesar the noble brutus hath told you Doc 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 caesar 2 was ambitious 2 2 After all documents have been parsed the inverted file is sorted by terms Term Doc I did enact julius caesar I was killed i the capitol brutus killed me so let it be with caesar the noble brutus hath told you caesar was ambitious 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 Term Doc ambitious 2 be 2 brutus 1 brutus 2 capitol 1 caesar 1 caesar 2 caesar 2 did 1 enact 1 hath 1 I 1 I 1 i 1 it 2 julius 1 killed 1 killed 1 let 2 me 1 noble 2 so 2 the 1 the 2 told 2 you 2 was 1 was 2 with 2 Multiple term entries in a single document are merged and frequency information added Term Doc ambitious 2 be 2 brutus 1 brutus 2 capitol 1 caesar 1 caesar 2 caesar 2 did 1 enact 1 hath 1 I 1 I 1 i 1 it 2 julius 1 killed 1 killed 1 let 2 me 1 noble 2 so 2 the 1 the 2 told 2 you 2 was 1 was 2 with 2 Term Doc ambitious be brutus brutus capitol caesar caesar did enact hath I i it julius killed let me noble so the the told you was was with Freq 2 2 1 2 1 1 2 1 1 2 1 1 2 1 1 2 1 2 2 1 2 2 2 1 2 2 1 1 1 1 1 1 2 1 1 1 2 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 The file is commonly split into a Dictionary and a Postings file Term Doc ambitious be brutus brutus capitol caesar caesar did enact hath I i it julius killed let me noble so the the told you was was with Freq 2 2 1 2 1 1 2 1 1 2 1 1 2 1 1 2 1 2 2 1 2 2 2 1 2 2 1 1 1 1 1 1 2 1 1 1 2 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 Doc Term N docs Tot Freq ambitious 1 1 be 1 1 brutus 2 2 capitol 1 1 caesar 2 3 did 1 1 enact 1 1 hath 1 1 I 1 2 i 1 1 it 1 1 julius 1 1 killed 1 2 let 1 1 me 1 1 noble 1 1 so 1 1 the 2 2 told 1 1 you 1 1 was 2 2 with 1 1 Freq 2 2 1 2 1 1 2 1 1 2 1 1 2 1 1 2 1 2 2 1 2 2 2 1 2 2 1 1 1 1 1 1 2 1 1 1 2 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 Where do we pay in storage Doc Terms Freq 2 2 1 2 1 1 2 1 1 2 1 1 2 1 1 2 1 2 2 1 2 2 2 1 2 2 Term N docs Tot Freq ambitious 1 1 be 1 1 brutus 2 2 capitol 1 1 caesar 2 3 did 1 1 enact 1 1 hath 1 1 I 1 2 i 1 1 it 1 1 julius 1 1 killed 1 2 let 1 1 me 1 1 noble 1 1 so 1 1 the 2 2 told 1 1 you 1 1 was 2 2 with 1 1 Pointers 1 1 1 1 1 1 2 1 1 1 2 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 Two conflicting forces A term like Calpurnia occurs in maybe one doc out of a million would like to store this pointer using log2 1M 20 bits A term like the occurs in virtually every doc so 20 bits pointer is too expensive Prefer 0 1 vector in this case Postings file entry Store list of docs containing a term in increasing order of doc id Brutus 33 47 154 159 202 Consequence suffices to store gaps 33 14 107 5 43 Hope most gaps encoded with far fewer than 20 bits Variable encoding For Calpurnia use 20 bits gap entry For the use 1 bit gap entry If the average gap for a term is G want to use log2G bits gap entry codes for gap encoding Length Offset Represent a gap G as the pair length offset length is in unary and uses log2G 1 bits to specify the length of the binary encoding of offset G 2 log2G e g 9 represented as 1110001 Encoding G takes 2 log2G 1 bits What we ve just done Encoded each gap as tightly as possible to within a factor of 2 For better tuning and a simple analysis need some handle on …
View Full Document