DOC PREVIEW
Stanford CS 276 - Index Compression

This preview shows page 1-2-3-23-24-25-26-46-47-48 out of 48 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 48 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 48 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 48 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 48 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 48 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 48 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 48 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 48 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 48 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 48 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 48 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Introduc on to Informa on Retrieval Introduc on to Informa on Retrieval CS276 Informa on Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan Lecture 5 Index Compression Introduc on to Informa on Retrieval Course work Problem set 1 due Thursday Programming exercise 1 will be handed out today 2 Introduc on to Informa on Retrieval Last lecture index construc on Sort based indexing Na ve in memory inversion Blocked Sort Based Indexing Merge sort is e ec ve for disk based sor ng avoid seeks Single Pass In Memory Indexing No global dic onary Generate separate dic onary for each block Don t sort pos ngs Accumulate pos ngs in pos ngs lists as they occur Distributed indexing using MapReduce Dynamic indexing Mul ple indices logarithmic merge 3 Introduc on to Informa on Retrieval Ch 5 Today Collec on sta s cs in more detail with RCV1 How big will the dic onary and pos ngs be Dic onary compression Pos ngs compression 4 Introduc on to Informa on Retrieval Ch 5 Why compression in general Use less disk space Saves a li le money Keep more stu in memory Increases speed Increase speed of data transfer from disk to memory read compressed data decompress is faster than read uncompressed data Premise Decompression algorithms are fast True of the decompression algorithms we use 5 Introduc on to Informa on Retrieval Ch 5 Why compression for inverted indexes Dic onary Make it small enough to keep in main memory Make it so small that you can keep some pos ngs lists in main memory too Pos ngs le s Reduce disk space needed Decrease me needed to read pos ngs lists from disk Large search engines keep a signi cant part of the pos ngs in memory Compression lets you keep more in memory We will devise various IR speci c compression schemes 6 Sec 5 1 Introduc on to Informa on Retrieval Recall Reuters RCV1 symbol N L M sta s c documents avg tokens per doc terms word types avg bytes per token value 800 000 200 400 000 6 incl spaces punct avg bytes per token 4 5 without spaces punct avg bytes per term 7 5 non posi onal pos ngs 100 000 000 7 Sec 5 1 Introduc on to Informa on Retrieval Index parameters vs what we index details IIR Table 5 1 p 80 size of word types terms non positional postings positional postings dictionary non positional index positional index Size K Size K cumul cumul Size K 109 971 cumul Unfiltered 484 197 879 No numbers 474 2 2 100 680 8 8 179 158 9 9 Case folding 392 17 19 96 969 3 12 179 158 0 9 30 stopwords 391 0 19 83 390 14 24 121 858 31 38 150 stopwords 391 0 19 67 002 30 39 94 517 47 52 stemming 322 17 33 63 812 42 94 517 52 4 0 Exercise give intuitions for all the 0 entries Why do some zero entries correspond to big deltas in other columns 8 Introduc on to Informa on Retrieval Sec 5 1 Lossless vs lossy compression Lossless compression All informa on is preserved What we mostly do in IR Lossy compression Discard some informa on Several of the preprocessing steps can be viewed as lossy compression case folding stop words stemming number elimina on Chap Lecture 7 Prune pos ngs entries that are unlikely to turn up in the top k list for any query Almost no loss quality for top k list 9 Introduc on to Informa on Retrieval Sec 5 1 Vocabulary vs collec on size How big is the term vocabulary That is how many dis nct words are there Can we assume an upper bound Not really At least 7020 1037 di erent words of length 20 In prac ce the vocabulary will keep growing with the collec on size Especially with Unicode 10 Introduc on to Informa on Retrieval Sec 5 1 Vocabulary vs collec on size Heaps law M kTb M is the size of the vocabulary T is the number of tokens in the collec on Typical values 30 k 100 and b 0 5 In a log log plot of vocabulary size M vs T Heaps law predicts a line with slope about It is the simplest possible rela onship between the two in log log space An empirical nding empirical law 11 Sec 5 1 Introduc on to Informa on Retrieval Heaps Law Fig 5 1 p81 For RCV1 the dashed line log10M 0 49 log10T 1 64 is the best least squares t Thus M 101 64T0 49 so k 101 64 44 and b 0 49 Good empirical t for Reuters RCV1 For rst 1 000 020 tokens law predicts 38 323 terms actually 38 365 terms 12 Introduc on to Informa on Retrieval Sec 5 1 Exercises What is the e ect of including spelling errors vs automa cally correc ng spelling errors on Heaps law Compute the vocabulary size M for this scenario Looking at a collec on of web pages you nd that there are 3000 di erent terms in the rst 10 000 tokens and 30 000 di erent terms in the rst 1 000 000 tokens Assume a search engine indexes a total of 20 000 000 000 2 1010 pages containing 200 tokens on average What is the size of the vocabulary of the indexed collec on as predicted by Heaps law 13 Introduc on to Informa on Retrieval Sec 5 1 Zipf s law Heaps law gives the vocabulary size in collec ons We also study the rela ve frequencies of terms In natural language there are a few very frequent terms and very many very rare terms Zipf s law The ith most frequent term has frequency propor onal to 1 i cfi 1 i K i where K is a normalizing constant cfi is collec on frequency the number of occurrences of the term ti in the collec on 14 Introduc on to Informa on Retrieval Sec 5 1 Zipf consequences If the most frequent term the occurs cf1 mes then the second most frequent term of occurs cf1 2 mes the third most frequent term and occurs cf1 3 mes Equivalent cfi K i where K is a normalizing factor so log cfi log K log i Linear rela onship between log cfi and log i Another power law rela onship 15 Introduc on to Informa on Retrieval Sec 5 1 Zipf s law for Reuters RCV1 16 Introduc on to Informa on Retrieval Ch 5 Compression Now we will consider compressing the space for the dic onary and pos ngs Basic Boolean index only No study of posi onal indexes etc We will consider compression schemes 17 Introduc on to Informa on Retrieval Sec 5 2 DICTIONARY COMPRESSION 18 Introduc on to Informa on Retrieval Sec 5 2 Why compress the dic onary Search begins with the dic onary We want to keep it in memory Memory footprint compe on with other applica ons Embedded mobile devices may have very li le memory Even …


View Full Document

Stanford CS 276 - Index Compression

Documents in this Course
Load more
Download Index Compression
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Index Compression and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Index Compression 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?