Unformatted text preview:

U.C. Berkeley — CS170: Intro to CS Theory Handout N16Professor Luca Trevisan October 30, 2001Notes for Lecture 161 The Lempel-Ziv algorithmThere is a sense in which the Huffman coding was “optimal”, but this is under severalassumptions:1. The compression is lossless, i.e. uncompressing the compressed file yields exactly theoriginal file. When lossy compression is permitted, as for video, other algorithms canachieve much greater compression, and this is a very active area of research becausepeople want to be able to send video and audio over the Web.2. We know all the frequencies f(i) with which each character app ears. How do we getthis information? We could make two passes over the data, the first to compute thef(i), and the second to encode the file. But this can be much more expensive thanpassing over the data once for large files residing on disk or tape. One way to dojust one pass over the data is to assume that the fractions f(i)/n of each character inthe file are similar to files you’ve compressed before. For example you could assumeall Java programs (or English text, or PowerPoint files, or ...) have about the samefractions of characters appearing. A second cleverer way is to estimate the fractionsf(i)/n on the fly as you process the file. One can make Huffman coding adaptive thisway.3. We know the set of characters (the alphabet) appearing in the file. This may seemobvious, but there is a lot of freedom of choice. For example, the alphabet could bethe characters on a keyboard, or they could be the key words and variables namesappearing in a program. To see what difference this can make, suppose we have afile consisting of n strings aaaa and n strings bbbb concatenated in some order. If wechoose the alphabet {a, b} then 8n bits are needed to encode the file. But if we choosethe alphabet {aaaa, bbbb} then only 2n bits are needed.Picking the correct alphab et turns out to be crucial in practical compression algorithms.Both the UNIX compress and GNU gzip algorithms use a greedy algorithm due to Lempeland Ziv to compute a good alphabet in one pass while compressing. Here is how it works.If s and t are two bit strings, we will use the notation s ◦t to mean the bit string gottenby concatenating s and t.We let F be the file we want to compress, and think of it just as a string of bits, that is0’s and 1’s. We will build an alphabet A of common bit strings encountered in F , and useit to compress F . Given A, we will break F into shorter bit strings likeF = A(1) ◦ 0 ◦ A(2) ◦ 1 ◦ ··· ◦A(7) ◦0 ◦··· ◦A(5) ◦ 1 ◦ ··· ◦ A(i) ◦ j ◦ ···and encode this by1 ◦ 0 ◦ 2 ◦ 1 ◦ ··· ◦ 7 ◦ 0 ◦ ··· ◦ 5 ◦ 1 ◦ ··· ◦ i ◦ j ◦···Notes for Lecture 16 2F = 1 1 1 1 0 01 1 10 0 0 00 0 0 1A(1) = A(0)0 = 0A(2) = A(1)0 = 00A(3) = A(0)1 = 1A(4) = A(3)1 = 11A(5) = A(3)0 = 10A(6) = A(5)1 = 101set A is full= A(6)0 = 1010= A(1)0 = 00Encoded F = (0,0),(1,0),(0,1), (3,1),(3,0), (5,1),(6,0),(1,0)= 0000 0010 0001 0111 0110 1011 1100 0010Figure 1: An example of the Lempel-Ziv algorithm.The indicesiofA(i) are in turn encoded as fixed length binary integers, and the bitsjare just bits. Given the fixed length (say r) of the binary integers, we decode by takingevery group of r + 1 bits of a compressed file, using the first r bits to look up a string inA, and concatenating the last bit. So when storing (or sending) an encoded file, a headercontaining A is also stored (or sent).Notice that while Huffman’s algorithm encodes blocks of fixed size into binary sequencesof variable length, Lempel-Ziv encodes blocks of varying length into blocks of fixed size.Here is the algorithm for encoding, including building A. Typically a fixed size isavailable for A, and once it fills up, the algorithm stops looking for new characters.A = {∅} ... start with an alphabet containing only an empty stringi = 0 ... points to next place in file f to start encodingrepeatfind A(k) in the current alphabet that matches as many leading bits fiFi+1fi+2··· as possible... initially only A(0) = empty string matches... Let b be the number of bits in A(k)if A is not full, add A(k) ◦ fi+bto A... fi+bis the first bit unmatched by A(k)output k ◦ Fi+bi = i + b + 1until i > length(F )Note that A is built “greedily”, based on the beginning of the file. Thus there areno optimality guarantees for this algorithm. It can perform badly if the nature of thefile changes substantially after A is filled up, however the algorithm makes only one passthrough the file (there are other possible implementations: A may be unbounded, and theindex k would be encoded with a variable-length code itself).In Figure 1 there is an example of the algorithm running, where the alphabet A fillsup after 6 characters are inserted. In this small example no compression is obtained, butif A were large, and the same long bit strings appeared frequently, compression wouldbe substantial. The gzip manpage claims that source code and English text is typicallycompressed 60%-70%.Notes for Lecture 16 3To observe an example, we took a latex file of 74, 892 bytes. Running Huffman’s algo-rithm, with bytes used as blocks, we could have compressed the file to 36, 757 bytes, plusthe space needed to specify the code. The Unix program compress produced an encodingof size 34, 385, while gzip produced an encoding of size 22, 815.2 Lower bounds on data compression2.1 Simple ResultsHow much can we compress a file without loss? We present some results that give lowerbounds for any compression algorithm. Let us start from a “worst case” analysis.Theorem 1Let C : {0, 1}n→ {0, 1}∗be an encoding algorithm that allows lossless deco ding (i.e. letC be an injective function mapping n bits into a sequence of bits). Then there is a filef ∈ {0, 1}nsuch that |C(f)| ≥ n.In words, for any lossless compression algorithm there is always a file that the algorithmis unable to compress.Proof: Suppose, by contradiction, that there is a compression algorithm C such that,for all f ∈ {0, 1}n, |C(f)| ≤ n − 1. Then the set {C(f) : f ∈ {0, 1}n} has 2nelementsbecause C is injective, but it is also a set of strings of length ≤ n −1, and so it has at mostPn−1l=12l≤ 2n− 2 elements, which gives a contradiction. 2While the previous analysis showed the existence of incompressible files, the next theo-rem shows that random files are hard to compress, thus giving an “average case” analysis.Theorem 2Let C : {0,


View Full Document

Berkeley COMPSCI 170 - Notes for Lecture 16

Download Notes for Lecture 16
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Notes for Lecture 16 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Notes for Lecture 16 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?