**Unformatted text preview:**

CS336: Intelligent Information RetrievalCompressionText Compression: Two classesCompression: Based on text modelSlide 5Stastic Statistical ModelsSemi-static Statistical ModelsAdaptive Statistical ModelsSlide 9Symbol-wise MethodsHuffman codingSlide 12Slide 13Dictionary MethodsSlide 15Ziv-LempelAccessing Compressed TextText CompressionInverted File CompressionFixed-Length CompressionExample CodeSlide 23Comparison of the text Compression techniquesCS336: Intelligent Information RetrievalLecture 6: CompressionCompression•Change the representation of a file so that–takes less space to store–less time to transmit•Text compression: original file can be reconstructed exactly •Data Compression: can tolerate small changes or noise–typically sound or images•already a digital approximation of an analog waveformText Compression: Two classes•Symbol-wise Methods–estimate the probability of occurrence of symbols–More accurate estimates => Greater compression•Dictionary Methods–represent several symbols as one codeword•compression performance based on length of codewords–replace words and other fragments of text with an index to an entry in a “dictionary”Compression: Based on text model•Determine output code based on pr distribution of model–# bits should use to encode a symbol equivalent to information content (H(s) = - log Pr(s))–Recall Information Theory and Entropy (H)•avg amount of information per symbol over an alphabet•H = (Pr(s) * I(s))–The more probable a symbol is, the less information it provides (easier to predict)–The less probable, the more information it providestextcompressed textencodermodeldecodermodelCompression: Based on text model•Symbol-wise methods (statistical)–Can assume independence assumption–Considering context achieves best compression rates•e.g. probability of seeing ‘u’ next is higher if we’ve just seen ‘q’–Must yield probability distribution such that the probability of the symbol actually occurring is very high•Dictionary methods–string representations typically based upon text seen so far•most chars coded as part of a string that has already occurred•Space is saved if the reference or pointer takes fewer bits than the string it replacestextcompressed textencodermodeldecodermodelStastic Statistical Models•Static: same model regardless of text processed–But, won’t be good for all inputs•Recall entropy: measures information content over collection of symbols–high pr(s) => low entropy; low pr(s) => high entropy–pr(s) = 1 => only s is possible, no encoding needed–pr(s) = 0 =>s won’t occur, so s can not be encoded–What is implication for models in which some pr(s) = 0, but s does in fact occur?iiippE21logs can not be encoded!Therefore in practice, all symbols must be assigned pr(s) > 0Semi-static Statistical Models•Generate model on fly for file being processed–first pass to estimate probabilities–probabilities transmitted to decoder before encoded transmission–disadvantage is having to transmit model firstAdaptive Statistical Models•Model constructed from the text just encoded–begin with general model, modify gradually as more text is seen•best models take into account context–decoder can generate same model at each time step since it has seen all characters up to that point–advantages: effective w/out fine-tuning to particular text and only 1 pass neededAdaptive Statistical Models•What about pr(0) problem?–allow 1 extra count divided evenly amongst unseen symbols•recall probabilities estimated via occurrence counts•assume 72,300 total occurrences, 5 unseen characters:•each unseen char, u, gets pr(u) = pr(char is one of the unseen) * pr(all unseen char) = 1/5 * 1/72301•Not good for full-text retrieval–must be decoded from beginning•therefore not good for random access to filesSymbol-wise Methods•Morse Code–common symbols assigned short codes (few bits)–rare symbols have longer codes•Huffman coding (prefix-free code)–no codeword is a prefix of another symbol’s codewordSymbol Codeword Probability a 0000 0.05 b 0001 0.05 c 001 0.1 d 01 0.2 e 10 0.3 f 110 0.2 g 111 0.1Huffman codingSymbol Codeword Probability a 0000 0.05 b 0001 0.05 c 001 0.1 d 01 0.2 e 10 0.3 f 110 0.2 g 111 0.1 •What is the string for 1010110111111110?•Decoder identifies 10 as first codeword => e•Decoding proceeds l to rt on remainder of string•Good when prob. distribution is static•Best when model is word-based•Typically used for compressing text in IReefggfSymbol Codeword Probability a 0000 0.05 b 0001 0.05 c 001 0.1 d 01 0.2 e 10 0.3 f 110 0.2 g 111 0.1 Create a decoding tree bottom-up1. Each symbol and pr are leafs2. 2 nodes w/ smallest p are joinedunder same parent (p = p(s1)+p(s2)3. Repeat, ignoring children, until allnodes are connected4. Label nodes•Left branch gets 0•Right branch 1 e 0.3 a 0.05 b 0.05 c 0.1 d 0.2 f 0.2 g 0.1 0.1 0.20.40.30.61.000001 1 111100•Requires 2 passes over the document1. gather statistics and build the coding table 2. encode the document•Must explicitly store coding table with document–eats into your space savings on short documents•Exploit only non-uniformity in symbol distribution– adaptive algorithms can recognize the higher-order redundancy in strings e.g. 0101010101.... Huffman codingDictionary Methods•Braille–special codes are used to represent whole words•Dictionary: list of sub-strings with corresponding code-words–we do this naturally e.g) replace “december” w/ “12”–codebook for ascii might contain •128 characters and 128 common letter pairs•At best each char encoded in 4 bits rather than 7–can be static, semi-static, or adaptive (best)Dictionary Methods•Ziv-Lempel–Built on the fly–Replace strings w/ reference to a previous occurrence –codebook•all words seen prior to current position•codewords represented by “pointers”–Compression: pointer stored in fewer bits than string–Especially useful when resources used must be minimized (e.g. 1 machine distributes data to many)•Relatively easy to implement•Decoding is fast•Requires only small amount of memoryZiv-Lempel•Coded output: series of triples <x,y,z>•x: how far back to look in previous decoded text to find next string•y: how long the string is•z: next character from the input–only necessary if char to

View Full Document