MIT 6 441 - LEMPEL-ZIV SLIDING WINDOW UNIVERSAL COMPRESSION - D1930416

Home> Schools> Massachusetts Institute of Technology> Electrical Engineering and Computer Science (6) > 6 441> LEMPEL-ZIV SLIDING WINDOW UNIVERSAL COMPRESSION

MIT 6 441 - LEMPEL-ZIV SLIDING WINDOW UNIVERSAL COMPRESSION

School name Massachusetts Institute of Technology

Course 6 441- Information Theory

Pages 13

Download Save

Unformatted text preview:

LEMPEL-ZIV SLIDING WINDOW UNIVERSAL COMPRESSION 6.441 Supplementary Notes 2b, Revised 3/14/94 INTRODUCTION: In the 70's, there were a number of attempts to develop data compression theories and techniques that could deal with sources with unknown statistics. The most successful of these (both from the standpoint of practice and theory) were two classes of algorithms developed by Jacob Ziv and Abraham Lempel [ZL77, ZL78]. The first scheme, often called LZ77, was a string matching sliding window scheme; the second, called LZ78, was an adaptive dictionary scheme. The second scheme was implemented many years ago in the UNIX compress algorithm and many other places. Implementations of the first scheme are more recent (Stacker and Microsoft MS-DOS-6); they are viewed as compressing more efficiently, but are more computationally intensive. A new paper, [WZ93], about LZ77 not only analyzes the string matching algorithm but also provides great insight into why universal algorithms work. This note follows [WZ93] closely to explain these issues. A universal data compression algorithm is an algorithm that runs without any knowledge of the source other than the alphabet. Thus, for a source with known statistics, a universal algorithm cannot compress data any better than an algorithm optimized to those source statistics. We already know that, for any ergodic source with entropy H∞ and any e > 0, it is possible to construct fixed to variable length codes (using complete knowledge of the source statistics) that use at most H∞+e binary digits per source letter; we also know that no codes of any type exist that use fewer than H∞ binary digits per source letter. What will be shown here is that LZ77, when applied to any ergodic source, uses at most H∞+e binary digits per source letter if the window w is sufficiently large. That is, these universal codes can reach the same limit of compressibility as codes using knowledge of the statistics. This is not too surprising, since the statistics can be measured and then used. What is somewhat surprising is that such a simple code is capable of achieving these asymptotic results. Unfortunately, the required window size depends on the statistics of the source, so that in practice, one chooses a fixed window size and hopes for the best.THE LZ77 ALGORITHM: Given a sequence of letters from a source of alphabet size K, and given a "window size" w which is a power of two: 1) Encode the first w letters without compression, using Èlog K˘ binary digits per letter. (this gets amortized over a very long string; all logs here are base 2) 2) Set pointer P = w (as the algorithm runs, P increases, always dividing the source sequence into its already encoded prefix x = x1, x2,...,xP and the yet to be encoded sequence x = xP+1, xP+2, ....) 1PP+13) Find the largest L such that x = x for some m, 1≤m≤w; set L=1 if no such match exists. P+1P+LP-m+1P-m+L Pa b a b a b c d L=4 m=2w = windowm....a c b d a b a c d P+1P-w Note that the largest match starts two letters inside the window (at m=2) and extends for two letters beyond the window (yielding a match of L=4). Having matches extend beyond the pointer sometimes allows longer matches (as in the present case), and also, as we see later, simplifies the analysis. Note also that there is another match, of length 3, starting one letter from the left end of the window at m=w-1. This is not used, since the longest match is selected. If there are two longest matches, either can be selected. Figure 1 4) Encode L into a unary-binary code, i.e., ÎlogL˚ zeros followed by the binary expansion of L: 1 2 3 4 5→→→→→1 010 011 00100 001016 7 8 9 10→→→→→00110 00111 0001000 0001001 0001010 (The length of the unary-binary code, for a given L is 2 ÎlogL˚+1; note that the unary-binary code is a fixed to variable length prefix condition coding of the positive integers; see [El75] for a treatment of such integer encodings).5) If Èlog K˘L > log w, encode m into log w binary digits; otherwise encode x without compression using Èlog K˘ bits per letter (i.e., Èlog K˘L bits overall). (One can use the ordinary binary encoding, 1Æ00....001, 2Æ00...010, ..., m-1Æ11...111, and then map m into 00...000). P+1P+L6) P := P+L; goto step 3 (go on to encode the next match). COMMENTS ON ALGORITHM: The algorithm is a variable length to variable length encoding. One can regard the dictionary (at any given time) as the set of substrings of the window, plus the single letters of the alphabet. Because of the single letters, the dictionary is valid. On the other hand, the prefix condition is not typically satisfied. The code words are of variable length both because of the unary-binary encoding of L and also because short matches (those with Èlog K˘L ≤ log w) are encoded without compression. In particular, the length n of a code word is a function of the length L of the match, and is given by n(L) = 2 ÎlogL˚+1 + min[logw, Èlog K˘L] (1) We shall frequently want simple upper bounds to n(L). The first comes from upper bounding 2 ÎlogL˚+1 by 3L/2 and upper bounding the min term by Èlog K˘L. This yields n(L) ≤ gL where g = 3/2 + Èlog K˘ (2) The second bound comes from upper bounding the min term by log w, yielding n(L) ≤ 2 logL + 1 + logw (3) Both bounds are valid for all L, but the first is useful for L small, and the second for L large. We next explain how a receiver of the encoded sequence can retrieve the source sequence. First the initial window is retrieved from the initial Èlog K˘w encoded binary digits. The next string of binary digits is the unary-binary encoding of the first match length. This is decoded, using the prefix condition of the encoding, and the receiver then determines if Èlog K˘L>logw. If so, the position m of the match is decoded, and xw+1w+L is retrieved by extracting xw-m+1w-m+L from the stored window; if not xw+1w+L is decoded directly.This then allows the window to be updated at the receiver, and further decoding proceeds in the same way. In the case of encoding a finite length string, say x, step 3 of the algorithm must be slightly modified to stop after the entire string is encoded. That is, step 3 must be modified to find the largest L ≤ N-P such that x = x for some m, 1≤m≤w. Also, step 6 must be modified to stop when P is incremented to N. 1NP+1P+L

View Full Document


School:
Email:
New Password:
Confirm Password:

MIT 6 441 - LEMPEL-ZIV SLIDING WINDOW UNIVERSAL COMPRESSION

Sign up for free to view:

Please select your school