Slide 1Course workLast lecture – index constructionTodayWhy compression (in general)?Why compression for inverted indexes?Recall Reuters RCV1Index parameters vs. what we index (details IIR Table 5.1, p.80)Lossless vs. lossy compressionVocabulary vs. collection sizeSlide 11Heaps’ LawExercisesZipf’s lawZipf consequencesZipf’s law for Reuters RCV1CompressionDictionary CompressionWhy compress the dictionary?Dictionary storage - first cutFixed-width terms are wastefulCompressing the term list: Dictionary-as-a-StringSpace for dictionary as a stringBlockingNetExerciseDictionary search without blockingDictionary search with blockingSlide 29Front codingRCV1 dictionary compression summaryPostings compressionSlide 33Postings: two conflicting forcesPostings file entryThree postings entriesVariable length encodingVariable Byte (VB) codesExampleOther variable unit codesUnary codeGamma codesGamma code examplesGamma code propertiesGamma seldom used in practiceRCV1 compressionIndex compression summaryResources for today’s lectureIntroduction to Information RetrievalIntroduction to Information Retrieval Introduction toInformation RetrievalCS276: Information Retrieval and Web SearchPandu Nayak and Prabhakar RaghavanLecture 5: Index CompressionIntroduction to Information RetrievalIntroduction to Information Retrieval Course workProblem set 1 due ThursdayProgramming exercise 1 will be handed out today2Introduction to Information RetrievalIntroduction to Information Retrieval Last lecture – index constructionSort-based indexingNaïve in-memory inversionBlocked Sort-Based IndexingMerge sort is effective for disk-based sorting (avoid seeks!)Single-Pass In-Memory IndexingNo global dictionaryGenerate separate dictionary for each blockDon’t sort postingsAccumulate postings in postings lists as they occurDistributed indexing using MapReduceDynamic indexing: Multiple indices, logarithmic merge3Introduction to Information RetrievalIntroduction to Information Retrieval TodayCollection statistics in more detail (with RCV1)How big will the dictionary and postings be?Dictionary compressionPostings compressionCh. 54Introduction to Information RetrievalIntroduction to Information Retrieval Why compression (in general)?Use less disk spaceSaves a little moneyKeep more stuff in memoryIncreases speedIncrease speed of data transfer from disk to memory[read compressed data | decompress] is faster than [read uncompressed data]Premise: Decompression algorithms are fastTrue of the decompression algorithms we useCh. 55Introduction to Information RetrievalIntroduction to Information Retrieval Why compression for inverted indexes?DictionaryMake it small enough to keep in main memoryMake it so small that you can keep some postings lists in main memory tooPostings file(s)Reduce disk space neededDecrease time needed to read postings lists from diskLarge search engines keep a significant part of the postings in memory.Compression lets you keep more in memoryWe will devise various IR-specific compression schemesCh. 56Introduction to Information RetrievalIntroduction to Information Retrieval Recall Reuters RCV1symbol statistic valueN documents 800,000L avg. # tokens per doc 200M terms (= word types) ~400,000 avg. # bytes per token 6 (incl. spaces/punct.) avg. # bytes per token 4.5 (without spaces/punct.) avg. # bytes per term 7.5 non-positional postings 100,000,000Sec. 5.17Introduction to Information RetrievalIntroduction to Information Retrieval Index parameters vs. what we index (details IIR Table 5.1, p.80)size of word types (terms) non-positionalpostingspositional postingsdictionary non-positional index positional indexSize (K)∆% cumul %Size (K) ∆ %cumul %Size (K) ∆ %cumul %Unfiltered 484 109,971 197,879No numbers 474 -2 -2 100,680 -8 -8 179,158 -9 -9Case folding 392 -17 -19 96,969 -3 -12 179,158 0 -930 stopwords 391 -0 -19 83,390 -14 -24 121,858 -31 -38150 stopwords 391 -0 -19 67,002 -30 -39 94,517 -47 -52stemming 322 -17 -33 63,812 -4 -42 94,517 0 -52Exercise: give intuitions for all the ‘0’ entries. Why do some zero entries correspond to big deltas in other columns? Sec. 5.18Introduction to Information RetrievalIntroduction to Information Retrieval Lossless vs. lossy compressionLossless compression: All information is preserved.What we mostly do in IR.Lossy compression: Discard some informationSeveral of the preprocessing steps can be viewed as lossy compression: case folding, stop words, stemming, number elimination.Chap/Lecture 7: Prune postings entries that are unlikely to turn up in the top k list for any query.Almost no loss quality for top k list.Sec. 5.19Introduction to Information RetrievalIntroduction to Information Retrieval Vocabulary vs. collection sizeHow big is the term vocabulary?That is, how many distinct words are there?Can we assume an upper bound?Not really: At least 7020 = 1037 different words of length 20In practice, the vocabulary will keep growing with the collection sizeEspecially with Unicode Sec. 5.110Introduction to Information RetrievalIntroduction to Information Retrieval Vocabulary vs. collection sizeHeaps’ law: M = kTbM is the size of the vocabulary, T is the number of tokens in the collectionTypical values: 30 ≤ k ≤ 100 and b ≈ 0.5In a log-log plot of vocabulary size M vs. T, Heaps’ law predicts a line with slope about ½It is the simplest possible relationship between the two in log-log spaceAn empirical finding (“empirical law”)Sec. 5.111Introduction to Information RetrievalIntroduction to Information Retrieval Heaps’ LawFor RCV1, the dashed linelog10M = 0.49 log10T + 1.64 is the best least squares fit.Thus, M = 101.64T0.49 so k = 101.64 ≈ 44 and b = 0.49.Good empirical fit for Reuters RCV1 !For first 1,000,020 tokens,law predicts 38,323 terms;actually, 38,365 termsFig 5.1 p81Sec. 5.112Introduction to Information RetrievalIntroduction to Information Retrieval ExercisesWhat is the effect of including spelling errors, vs. automatically correcting spelling errors on Heaps’ law?Compute the vocabulary size M for this scenario:Looking at a collection of web pages, you find that there are 3000 different terms in the first 10,000 tokens and 30,000 different terms in the first 1,000,000 tokens.Assume a
View Full Document