UT CS 337 - Compression -Other Lossless Compression Algorithms - D1888614

Home> Schools> University of Texas at Austin> Computer Science (CS) > CS 337> Compression -Other Lossless Compression Algorithms

UT CS 337 - Compression -Other Lossless Compression Algorithms

School name University of Texas at Austin

Course Cs 337- Theory in Programming Practice

Pages 15

Download Save

Unformatted text preview:

Compression: Other Lossless CompressionAlgorithmsGreg PlaxtonTheory in Programming Practice, Spring 2005Department of Computer ScienceUniversity of Texas at AustinLZ78 (Lempel-Ziv)• The encoder and decoder each maintain a “dictionary” containingcertain words seen previously– Initially the dictionary contains only the empty string (in practice itis often initialized to the set of single-symbol words)– The algorithm maintains the invariant that the encoder and decoderdictionaries are the same (except the decoder dictionary can lagbehind by a word)– The encoder communicates a dictionary entry to the decoder bysending an integer index into the dictionary– If the dictionary becomes full, a common strategy is to evict theLRU entryTheory in Programming Practice, Plaxton, Spring 2005LZ78: Outline of a Single Iteration• Suppose the encoder has consumed some prefix of the input sequence• The encoder now considers successively longer prefixes of the remaininginput until it finds the first prefix αx such that α is a word in thedictionary and αx is not a word in the dictionary• The word αx is added to the dictionary of the encoder• The word αx is communicated to the decoder by transmitting the indexi of α and the symbol x• The decoder uses its dictionary to map i to α, and then adds the wordαx to its dictionaryTheory in Programming Practice, Plaxton, Spring 2005LZ78: Dictionary Data Structure• It is common to implement the dictionary as a trie– If the set of symbols is, e.g., the 256 possible bytes, then each nodeof the trie might have an array of length 256 to store its children– While fast (linear time), this implementation is somewhat inefficientin terms of space– A trick that can achieve a good space-time tradeoff is to store thechildren of a trie node in a linked list until the number of children issufficiently large (say 10 or so), and then switch to an array– Alternatively, the children of a trie node could be stored in a hashtable• The integers used to represent dictionary entries are indices into anarray of pointers into the trieTheory in Programming Practice, Plaxton, Spring 2005LZ Algorithms• Quite a few variations of LZ77 and LZ78 have been proposed• The LZ algorithms are popular because they run in a single pass,provide good compression, are easy to code, and run quickly• Used in popular compression utilities such as compress, gzip, andWinZipTheory in Programming Practice, Plaxton, Spring 2005Arithmetic Coding• Assume an i.i.d. source with alphabet A and where the ith symbol inA has associated probability pi, 1 ≤ i ≤ n = |A|• Map each input string to a subinterval of the real interval [0, 1]– Chop up the unit interval based on the first symbol of the string,with the ith symbol assigned to the subinterval[X1≤j<ipj,X1≤j≤ipj]– Recursively construct the mapping within each subinterval to handlestrings of length 2, then 3, et cetera• The encoder specifies the real interval corresponding to the next fixed-size block of symbols to be sentTheory in Programming Practice, Plaxton, Spring 2005Arithmetic Coding: Specifying a Particular Interval• To specify an interval, the encoder sends a (variable length) bit stringthat is itself interpreted as a subinterval of [0, 1]– For example, 010 is interpreted as the interval containing all realswith binary expansion of the form .010∗∗∗. . . where the ∗’s representdon’t cares (0 or 1)– Thus 010 corresponds to [1/4, 3/8), 0 corresponds to [0, 1/2), 11corresponds to [3/4, 1), et cetera• Once the decoder has received a bit string that is entirely containedwithin an interal corresponding to a particular block, it outputs thatblock and proceeds to the next iterationTheory in Programming Practice, Plaxton, Spring 2005Arithmetic Coding: An Example• Consider A = {a, b} where the probability associated with a is close to1, e.g., 0.99– The entropy per symbol is close to zero, so a direct application ofHuffman coding performs poorly– Even with a block size of 50, arithmetic coding communicates theall-a’s block using only a single bit since 0.9950> 1/2Theory in Programming Practice, Plaxton, Spring 2005Run-Length Coding• Another technique that is useful for dealing with certain low-entropysources• The basic idea is to encode a run of length k the same symbol a as thepair (a, k)• The resulting sequence of pairs are then typically coded using someother technique, e.g., Huffman coding• Example: FAX protocols– Run-length coding converts document to alternating runs of whiteand black pixels– Run lengths are encoded using a fixed Huffman code that works wellon typical documents– A long run such as 500 might be coded by passing Huffman codesfor 128+, 128+, 128+, 64+, 52Theory in Programming Practice, Plaxton, Spring 2005Move-To-Front Coding• A good technique for dealing with sources where the output favorscertain symbols for a while, then favors another set of symbols, etcetera• Keep the symbols in a list• When a symbol is transmitted, move it to the head of the list• Transmit a symbol by indicating its current position (index) in the list• The hope is that we will mostly be sending small indicesTheory in Programming Practice, Plaxton, Spring 2005Move-To-Front Coding: Compressing the IndexSequence• The sequence of indices can be compressed using another method suchas Huffman coding• An easy alternative (though perhaps unlikely to give the bestperformance) is to encode each k-bit index using 2k − 1 bits asfollows– Assume the lowest index is 1; thus k > 0– Send (k − 1) 0’s followed by the k-bit index– The decoder counts the leading zeros to determine k, then decodesthe k-bit indexTheory in Programming Practice, Plaxton, Spring 2005Prediction by Partial Matching• This is essentially the approach that Shannon used in his experimentswith English text discussed in an earlier lecture• The idea is to maintain, for each string α of some fixed length k,the conditional probability distribution for the symbol that follows thestring α• The encoder specifies the next symbol using some appropriate code,e.g., a Huffman code for the given probability distribution• Shannon showed that for a wide class of discrete Markov sources, theperformance of this technique approaches the entropy lower bound fork sufficiently large– But in practice we cannot afford to use a value of k that is very largesince the number of separate probability

View Full Document


School:
Email:
New Password:
Confirm Password:

UT CS 337 - Compression -Other Lossless Compression Algorithms

Sign up for free to view:

Please select your school