Stanford CS 276 - Lecture 5: Index Compression - D3623811

Home> Schools> Stanford University> Computer Science (CS) > CS 276> Lecture 5: Index Compression

Stanford CS 276 - Lecture 5: Index Compression

Course Cs 276- Information Retrieval and Web Search

Pages 48

Download Save

Unformatted text preview:

Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan Lecture 5 Index Compression Introduction to Information Retrieval Introduction to Information Retrieval Course work Problem set 1 due Thursday Programming exercise 1 will be handed out today 2 Introduction to Information Retrieval Introduction to Information Retrieval Last lecture index construction Sort based indexing Na ve in memory inversion Blocked Sort Based Indexing Merge sort is effective for disk based sorting avoid seeks Single Pass In Memory Indexing No global dictionary Generate separate dictionary for each block Don t sort postings Accumulate postings in postings lists as they occur Distributed indexing using MapReduce Dynamic indexing Multiple indices logarithmic merge 3 Introduction to Information Retrieval Introduction to Information Retrieval Ch 5 Today Collection statistics in more detail with RCV1 How big will the dictionary and postings be Dictionary compression Postings compression 4 Introduction to Information Retrieval Introduction to Information Retrieval Ch 5 Why compression in general Use less disk space Saves a little money Keep more stuff in memory Increases speed Increase speed of data transfer from disk to memory read compressed data decompress is faster than read uncompressed data Premise Decompression algorithms are fast True of the decompression algorithms we use 5 Introduction to Information Retrieval Introduction to Information Retrieval Ch 5 Why compression for inverted indexes Dictionary Make it small enough to keep in main memory Make it so small that you can keep some postings lists in main memory too Postings file s Reduce disk space needed Decrease time needed to read postings lists from disk Large search engines keep a significant part of the postings in memory Compression lets you keep more in memory We will devise various IR specific compression schemes 6 Introduction to Information Retrieval Introduction to Information Retrieval Sec 5 1 Recall Reuters RCV1 symbol statistic N documents L avg tokens per doc M terms word types avg bytes per token incl spaces punct value 800 000 200 400 000 6 avg bytes per token without spaces punct avg bytes per term non positional postings 100 000 000 7 5 4 5 7 Introduction to Information Retrieval Introduction to Information Retrieval Sec 5 1 Index parameters vs what we index details IIR Table 5 1 p 80 dictionary Size K size of word types terms non positional postings non positional index positional postings positional index cumul Size K cumul Size K cumul 484 474 2 392 17 0 391 391 0 322 17 Unfiltered No numbers Case folding 30 stopwords 150 stopwords stemming Exercise give intuitions for all the 0 entries Why do some zero entries correspond to big deltas in other columns 197 879 9 8 179 158 0 12 179 158 24 121 858 31 39 94 517 47 0 94 517 42 109 971 8 100 680 3 96 969 83 390 14 67 002 30 4 63 812 2 19 19 19 33 9 9 38 52 52 8 Introduction to Information Retrieval Introduction to Information Retrieval Sec 5 1 Lossless vs lossy compression Lossless compression All information is preserved What we mostly do in IR Lossy compression Discard some information Several of the preprocessing steps can be viewed as lossy compression case folding stop words stemming number elimination Chap Lecture 7 Prune postings entries that are unlikely to turn up in the top k list for any query Almost no loss quality for top k list 9 Introduction to Information Retrieval Introduction to Information Retrieval Sec 5 1 Vocabulary vs collection size How big is the term vocabulary That is how many distinct words are there Can we assume an upper bound Not really At least 7020 1037 different words of length 20 In practice the vocabulary will keep growing with the collection size Especially with Unicode 10 Introduction to Information Retrieval Introduction to Information Retrieval Sec 5 1 Vocabulary vs collection size Heaps law M kTb M is the size of the vocabulary T is the number of tokens in the collection Typical values 30 k 100 and b 0 5 In a log log plot of vocabulary size M vs T Heaps law predicts a line with slope about It is the simplest possible relationship between the two in log log space An empirical finding empirical law 11 Introduction to Information Retrieval Introduction to Information Retrieval Sec 5 1 Heaps Law Fig 5 1 p81 For RCV1 the dashed line log10M 0 49 log10T 1 64 is the best least squares fit Thus M 101 64T0 49 so k 101 64 44 and b 0 49 Good empirical fit for Reuters RCV1 For first 1 000 020 tokens law predicts 38 323 terms actually 38 365 terms 12 Introduction to Information Retrieval Introduction to Information Retrieval Sec 5 1 Exercises What is the effect of including spelling errors vs automatically correcting spelling errors on Heaps law Compute the vocabulary size M for this scenario Looking at a collection of web pages you find that there are 3000 different terms in the first 10 000 tokens and 30 000 different terms in the first 1 000 000 tokens Assume a search engine indexes a total of 20 000 000 000 2 1010 pages containing 200 tokens on average What is the size of the vocabulary of the indexed collection as predicted by Heaps law 13 Introduction to Information Retrieval Introduction to Information Retrieval Sec 5 1 Zipf s law Heaps law gives the vocabulary size in collections We also study the relative frequencies of terms In natural language there are a few very frequent terms and very many very rare terms Zipf s law The ith most frequent term has frequency proportional to 1 i 1 i K i where K is a normalizing constant cfi cfi is collection frequency the number of occurrences of the term ti in the collection 14 Introduction to Information Retrieval Introduction to Information Retrieval Sec 5 1 Zipf consequences If the most frequent term the occurs cf1 times then the second most frequent term of occurs cf1 2 times the third most frequent term and occurs cf1 3 times Equivalent cfi K i where K is a normalizing factor so log cfi log K log i Linear relationship between log cfi and log i Another power law relationship 15 Introduction to Information Retrieval Introduction to Information Retrieval Sec 5 1 Zipf s law for Reuters RCV1 16 Introduction to Information Retrieval Introduction to Information Retrieval Ch 5 Compression Now we will consider compressing the space for the dictionary and postings Basic Boolean index only No study of positional indexes

View Full Document


School:
Email:
New Password:
Confirm Password:

Stanford CS 276 - Lecture 5: Index Compression

Sign up for free to view:

Please select your school