MIT 6 441 - VARIABLE TO FIXED LENGTH ADAPTIVE SOURCE CODING

Unformatted text preview:

VARIABLE TO FIXED LENGTH ADAPTIVE SOURCE CODING - LEMPEL-ZIV CODING 6.441 Supplementary Notes 2a, 2/17/94 1) THE LEMPEL-ZIV ALGORITHM An adaptive source coder, or a universal encoder, is designed to compress data from any source. The Lempel-Ziv [LZ76, ZL77,ZL78] coding strategy is essentially a variable to fixed length code containing a parsing dictionary of source strings (as with the Tunstall codes), but this dictionary changes dynamically. The algorithm as described here and in [ZL78] is primarily appropriate for asymptotic analysis; more practical versions are discussed in [Sto88]. More recent Lempel Ziv encoders use the strategy of [ZL77], which is a sliding window version as discussed in class. Let u1, u2, ...,un be the source sequence to be encoded. Initially we view n as ∞, and later we can view n either as the total number of letters to be encoded or as a parameter, where the encoder encodes n characters and then starts over with the next n characters, etc. The general idea is to build a Tunstall code adaptively, starting with a dictionary that contains just the single letters of the K letter source alphabet. The dictionary is then enlarged as the encoding proceeds. The rule for enlarging the dictionary is essentially an adaptive version of the rule used to construct the Tunstall code. That is, in constructing a Tunstall code, one successively takes the most probable word in the current dictionary and replaces it with all K single letter extensions of that word. In the Lempel-Ziv code, each time a word in the current dictionary is encoded, that word is replaced in the dictionary with all of its single letter extensions1. EXAMPLE 1: Consider the sequence a a a a b ... from the alphabet {a,b,c}. Fig. 1 illustrates how this is parsed into a | a a | a b | ... and how the dictionary tree grows as this parsing takes place. Each time the encoder parses another segment, it generates a 1In [ZL78], and in most of the literature on this topic, the dictionary is regarded as the set of intermediate nodes above, and the encoding is viewed as sending a dictionary entry followed by a single character (i.e., in our terms sending the leaf node of the tree). The distinction does not affect the algorithm, which is exactly the same in both cases, but allows us to bring out clearly the relationship between Tunstall codes and Lempel-Ziv codes.2.2 binary code word for that segment. We postpone the question of how the mapping from dictionary entries to code words takes place until later. abcabcabcabcabcabcOriginal dictionarydictionary after encoding adictionary after encoding a aabcabcabcdictionary after encoding a bcbaFigure 1 EXAMPLE 2: Consider an unending sequence of 0's from the binary alphabet {0, 1}. This gets parsed into 0 | 0 0 | 0 0 0 | 0 0 0 0 | 0 0 0 0 0 | ... . The corresponding code tree is shown in Figure 2. Note that the number of source letters involved in the first c parsed strings is 1+2+3+...+c, or c(c+1)/2. Thus as the length n of the source sequence increases, the number of parsed strings grows roughly as 2n (see Figure 2). It is not difficult to imagine that this example yields the fastest possible increase in the size of the individual parsed strings with n and the slowest possible growth in the total number of parsed strings with n (see Exercise 1 at the end of these notes). 010000011111Figure 2 EXAMPLE 3: Consider a sequence made up of the concatenation of all binary strings, starting with strings of length 1, then strings of length 2, etc. Thus the sequence, using spaces to make the structure more apparent visually, is 0 1 00 01 10 11 000 001 010 ... . Note that the Lempel-Ziv algorithm parses this sequence according to the spaces above (i.e., into the binary strings being concatenated in the construction). The dictionary tree, after encoding the part of the sequence shown above, is illustrated in Figure 3. Note that the size of the parsed strings grows as log2 n where n is the length of the input sequence2.3 already encoded, and thus the total number of parsed strings is on the order of n/ log2 n. It is not difficult to imagine that this example yields the slowest possible increase in the size of the individual parsed strings with n and the fastest possible growth in the total number of parsed strings with n (see Lemma 1 in section 3). 00000000000011010111Figure 3 Now consider the encoding of dictionary entries into binary strings. In practice, Lempel-Ziv encoding is usually done with a maximum dictionary size, with some rule to replace old dictionary entries with new entries. To understand the asymptotic operation of the algorithm, however, we want to view the dictionary as simply growing with time. After c-1 parses, the size of the dictionary is K+(c-1)(K-1), where K is the source alphabet size. Thus Èlog2[K+(c-1)(K-1)]˘ bits are required for a fixed length encoding of the cth parsed string. This means that all code words at a given time have the same length, but the code word length gradually increases with time (i.e., with successive parses). The particular mapping of dictionary entries into binary code words is simply a matter of implementation convenience, but must of course follow some given algorithm. Note that when the decoder attempts to decode the encoded message stream, it can duplicate the actions of the encoder. That is, for example 3 above, the encoder might map the initial string 0 into the code word 0 (since the size of the initial dictionary is 2). The next string, 1, must be mapped into a code word of length two since the size of the dictionary is now 3. The decoder, on seeing the initial 0 in the encoded sequence, decodes the source letter 0 and also knows that the dictionary now contains {00, 01, 1}. The decoder also knows the algorithm for mapping dictionary entries into code words, and knows that the next code word will have length 2. Thus the decoder can decode the second and third encoded binary digits into the source string 1 and again enlarge the dictionary. In this way, the decoder always knows what the current dictionary and what the current code word length is at the beginning of trying to decode a new code word.2.4 ∞∞In general, we can view the dictionary at any time as a tree. The terminal nodes are strings that have never been used up to the given time and each intermediate node has occurred at least once in the parsing of the currently entered input sequence. In fact the


View Full Document

MIT 6 441 - VARIABLE TO FIXED LENGTH ADAPTIVE SOURCE CODING

Download VARIABLE TO FIXED LENGTH ADAPTIVE SOURCE CODING
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view VARIABLE TO FIXED LENGTH ADAPTIVE SOURCE CODING and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view VARIABLE TO FIXED LENGTH ADAPTIVE SOURCE CODING 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?