Query Which plays of Shakespeare contain the words Brutus AND Caesar but NOT Calpurnia CS347 Lecture 1 April 4 2001 Prabhakar Raghavan Term document incidence Incidence vectors Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth Antony 1 1 0 0 0 1 Brutus 1 1 0 1 0 0 Caesar 1 1 0 1 1 1 Calpurnia 0 1 0 0 0 0 Cleopatra 1 0 0 0 0 0 mercy 1 0 1 1 1 1 worser 1 0 1 1 1 0 So we have a 0 1 vector for each term To answer query take the vectors for Brutus Caesar and Calpurnia complemented bitwise AND 110100 AND 110111 AND 101111 100100 1 if play contains word 0 otherwise 1 Answers to query Bigger corpora Antony and Cleopatra Act III Scene ii Agrippa Aside to DOMITIUS ENOBARBUS Why Enobarbus When Antony found Julius Caesar dead He cried almost to roaring and he we pt When at Philippi he found Brutus slain Consider n 1M documents each with about 1K terms Avg 6 bytes term incl spaces punctuation 6GB of data Hamlet Act III Scene ii Say there are m 500K distinct terms among these Lord Polonius I did enact Julius Caesar I was killed i the Capitol Brutus killed me Can t build the matrix 500K x 1M matrix has half a trillion 0 s and 1 s But it has no more than one billion 1 s matrix is extremely sparse What s a better representation Inverted index Documents are parsed to extract words and these are saved with the Document ID Why Doc 1 I did enact Julius Caesar I was killed i the Capitol Brutus killed me Doc 2 So let it be with Caesar The noble Brutus hath told you Caesar was ambitious Term I Doc 1 did enact julius caesar I was killed i the capitol brutus killed 1 1 1 1 1 1 1 1 1 1 1 1 me so let it be with caesar the noble brutus hath told 1 2 2 2 2 2 2 2 2 2 2 2 you 2 caesar 2 was ambitious 2 2 2 After all documents have been parsed the inverted file is sorted by terms Term Doc I did enact julius caesar I was killed i the capitol brutus killed me so let it be with caesar the noble brutus hath told you caesar was ambitious 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 Term Doc ambitious be brutus brutus capitol caesar caesar caesar did enact hath I I i it julius killed killed let me noble so the the told you was was with The file is commonly split into a Dictionary and a Postings file Term Doc ambitious be brutus brutus capitol caesar caesar did enact hath I i it julius killed let me noble so the the told you was was with Freq 2 2 1 2 1 1 2 1 1 2 1 1 2 1 1 2 1 2 2 1 2 2 2 1 2 2 1 1 1 1 1 1 2 1 1 1 2 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 Multiple term entries in a single document are merged and frequency information added Term Doc ambitious be brutus brutus capitol caesar caesar caesar did enact hath I I i it julius killed killed let me noble so the the told you was was with 2 2 1 2 1 1 2 2 1 1 1 1 1 1 2 1 1 1 2 1 2 2 1 2 2 2 1 2 2 Term Doc ambitious be brutus brutus capitol caesar caesar did enact hath I i it julius killed let me noble so the the told you was was with Freq 2 2 1 2 1 1 2 1 1 2 1 1 2 1 1 2 1 2 2 1 2 2 2 1 2 2 1 1 1 1 1 1 2 1 1 1 2 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 Where do we pay in storage Doc Doc Term N docs Tot Freq ambitious 1 1 be 1 1 brutus 2 2 capitol 1 1 caesar 2 3 did 1 1 enact 1 1 hath 1 1 I 1 2 i 1 1 it 1 1 julius 1 1 killed 1 2 let 1 1 me 1 1 noble 1 1 so 1 1 the 2 2 told 1 1 you 1 1 was 2 2 with 1 1 2 2 1 2 1 1 2 2 1 1 1 1 1 1 2 1 1 1 2 1 2 2 1 2 2 2 1 2 2 Freq 2 2 1 2 1 1 1 1 1 1 1 2 1 1 1 2 1 1 2 1 1 2 1 2 1 1 1 1 2 1 2 1 2 1 1 1 2 1 2 2 1 1 1 1 2 1 2 2 1 1 1 1 Terms Term N docs Tot Freq ambitious 1 1 be 1 1 brutus 2 2 capitol 1 1 caesar 2 3 did 1 1 enact 1 1 hath 1 1 I 1 2 i 1 1 it 1 1 julius 1 1 killed 1 2 let 1 1 me 1 1 noble 1 1 so 1 1 the 2 2 told 1 1 you 1 1 was 2 2 with 1 1 Freq 2 2 1 2 1 1 1 1 1 1 1 2 1 1 2 1 1 2 1 1 1 2 1 2 1 1 2 1 1 1 2 1 1 2 2 1 2 1 1 1 1 1 2 2 1 2 2 1 1 1 1 1 Pointers 3 Two conflicting forces A term like Calpurnia occurs in maybe one doc out of a million would like to store this pointer using log2 1M 20 bits A term like the occurs in virtually every doc so 20 bits pointer is too expensive Prefer 0 1 vector in this case Variable encoding For Calpurnia use 20 bits gap entry For the use 1 bit gap entry If the average gap for a term is G want to use log2 G bits gap entry Postings file entry Store list of docs containing a term in increasing order of doc id Brutus 33 47 154 159 202 Consequence suffices to store gaps 33 14 107 5 43 Hope most gaps encoded with far fewer than 20 bits codes for gap encoding Length Offset Represent a gap G as the pair length offset length is in unary and uses log2 G 1 bits to specify the length of the binary encoding of offset G 2 log2G e g 9 represented as 1110001 Encoding G takes 2 log2G 1 bits 4 What we ve just done Encoded each gap as tightly as possible to within a factor of 2 For better tuning and …
View Full Document