Stanford CS 276 - Index Compression - D923344

Home> Schools> Stanford University> Computer Science (CS) > CS 276> Index Compression

DOC PREVIEW

Stanford CS 276 - Index Compression

School name Stanford University

Course Cs 276- Information Retrieval and Web Search

Pages 48

This preview shows page 1-2-3-23-24-25-26-46-47-48 out of 48 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 48 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 48 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 48 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 48 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 48 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 48 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 48 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 48 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 48 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 48 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 48 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Slide 1Course workLast lecture – index constructionTodayWhy compression (in general)?Why compression for inverted indexes?Recall Reuters RCV1Index parameters vs. what we index (details IIR Table 5.1, p.80)Lossless vs. lossy compressionVocabulary vs. collection sizeSlide 11Heaps’ LawExercisesZipf’s lawZipf consequencesZipf’s law for Reuters RCV1CompressionDictionary CompressionWhy compress the dictionary?Dictionary storage - first cutFixed-width terms are wastefulCompressing the term list: Dictionary-as-a-StringSpace for dictionary as a stringBlockingNetExerciseDictionary search without blockingDictionary search with blockingSlide 29Front codingRCV1 dictionary compression summaryPostings compressionSlide 33Postings: two conflicting forcesPostings file entryThree postings entriesVariable length encodingVariable Byte (VB) codesExampleOther variable unit codesUnary codeGamma codesGamma code examplesGamma code propertiesGamma seldom used in practiceRCV1 compressionIndex compression summaryResources for today’s lectureIntroduction to Information RetrievalIntroduction to Information Retrieval Introduction toInformation RetrievalCS276: Information Retrieval and Web SearchPandu Nayak and Prabhakar RaghavanLecture 5: Index CompressionIntroduction to Information RetrievalIntroduction to Information Retrieval Course workProblem set 1 due ThursdayProgramming exercise 1 will be handed out today2Introduction to Information RetrievalIntroduction to Information Retrieval Last lecture – index constructionSort-based indexingNaïve in-memory inversionBlocked Sort-Based IndexingMerge sort is effective for disk-based sorting (avoid seeks!)Single-Pass In-Memory IndexingNo global dictionaryGenerate separate dictionary for each blockDon’t sort postingsAccumulate postings in postings lists as they occurDistributed indexing using MapReduceDynamic indexing: Multiple indices, logarithmic merge3Introduction to Information RetrievalIntroduction to Information Retrieval TodayCollection statistics in more detail (with RCV1)How big will the dictionary and postings be?Dictionary compressionPostings compressionCh. 54Introduction to Information RetrievalIntroduction to Information Retrieval Why compression (in general)?Use less disk spaceSaves a little moneyKeep more stuff in memoryIncreases speedIncrease speed of data transfer from disk to memory[read compressed data | decompress] is faster than [read uncompressed data]Premise: Decompression algorithms are fastTrue of the decompression algorithms we useCh. 55Introduction to Information RetrievalIntroduction to Information Retrieval Why compression for inverted indexes?DictionaryMake it small enough to keep in main memoryMake it so small that you can keep some postings lists in main memory tooPostings file(s)Reduce disk space neededDecrease time needed to read postings lists from diskLarge search engines keep a significant part of the postings in memory.Compression lets you keep more in memoryWe will devise various IR-specific compression schemesCh. 56Introduction to Information RetrievalIntroduction to Information Retrieval Recall Reuters RCV1symbol statistic valueN documents 800,000L avg. # tokens per doc 200M terms (= word types) ~400,000 avg. # bytes per token 6 (incl. spaces/punct.) avg. # bytes per token 4.5 (without spaces/punct.) avg. # bytes per term 7.5 non-positional postings 100,000,000Sec. 5.17Introduction to Information RetrievalIntroduction to Information Retrieval Index parameters vs. what we index (details IIR Table 5.1, p.80)size of word types (terms) non-positionalpostingspositional postingsdictionary non-positional index positional indexSize (K)∆% cumul %Size (K) ∆ %cumul %Size (K) ∆ %cumul %Unfiltered 484 109,971 197,879No numbers 474 -2 -2 100,680 -8 -8 179,158 -9 -9Case folding 392 -17 -19 96,969 -3 -12 179,158 0 -930 stopwords 391 -0 -19 83,390 -14 -24 121,858 -31 -38150 stopwords 391 -0 -19 67,002 -30 -39 94,517 -47 -52stemming 322 -17 -33 63,812 -4 -42 94,517 0 -52Exercise: give intuitions for all the ‘0’ entries. Why do some zero entries correspond to big deltas in other columns? Sec. 5.18Introduction to Information RetrievalIntroduction to Information Retrieval Lossless vs. lossy compressionLossless compression: All information is preserved.What we mostly do in IR.Lossy compression: Discard some informationSeveral of the preprocessing steps can be viewed as lossy compression: case folding, stop words, stemming, number elimination.Chap/Lecture 7: Prune postings entries that are unlikely to turn up in the top k list for any query.Almost no loss quality for top k list.Sec. 5.19Introduction to Information RetrievalIntroduction to Information Retrieval Vocabulary vs. collection sizeHow big is the term vocabulary?That is, how many distinct words are there?Can we assume an upper bound?Not really: At least 7020 = 1037 different words of length 20In practice, the vocabulary will keep growing with the collection sizeEspecially with Unicode Sec. 5.110Introduction to Information RetrievalIntroduction to Information Retrieval Vocabulary vs. collection sizeHeaps’ law: M = kTbM is the size of the vocabulary, T is the number of tokens in the collectionTypical values: 30 ≤ k ≤ 100 and b ≈ 0.5In a log-log plot of vocabulary size M vs. T, Heaps’ law predicts a line with slope about ½It is the simplest possible relationship between the two in log-log spaceAn empirical finding (“empirical law”)Sec. 5.111Introduction to Information RetrievalIntroduction to Information Retrieval Heaps’ LawFor RCV1, the dashed linelog10M = 0.49 log10T + 1.64 is the best least squares fit.Thus, M = 101.64T0.49 so k = 101.64 ≈ 44 and b = 0.49.Good empirical fit for Reuters RCV1 !For first 1,000,020 tokens,law predicts 38,323 terms;actually, 38,365 termsFig 5.1 p81Sec. 5.112Introduction to Information RetrievalIntroduction to Information Retrieval ExercisesWhat is the effect of including spelling errors, vs. automatically correcting spelling errors on Heaps’ law?Compute the vocabulary size M for this scenario:Looking at a collection of web pages, you find that there are 3000 different terms in the first 10,000 tokens and 30,000 different terms in the first 1,000,000 tokens.Assume a

View Full Document