Lempel Ziv Notes Prof Peter Shor We now explain the algorithm that Lempel and Ziv gave in a 1978 paper and generally called LZ78 This is opposed to LZ77 an earlier algorithm which differed significantly in the implementation details but is based on the same general idea This idea is that if some text is not random a substring that you see once is more likely to appear again than substrings you haven t seen The LZ78 algorithm works by constructing a dictionary of substrings which we will call phrases that have appeared in the text The LZ78 algorithm constructs its dictionary on the fly only going through the data once This is a great advantage in that you don t have to receive the entire document before starting to encode it This might be a problem if for example the first half of some document is in English and the second half is in Chinese In this case the dictionary constructed for the first half will be suboptimal when used on the second half There are many variations of Lempel Ziv around but they all follow the same basic idea We ll just concentrate on LZ78 because it is one of the simplest to explain and analyze although other variants may work somewhat better in practice The idea is to parse the sequence into distinct phrases The version we analyze does this greedily Suppose for example we have the string AABABBBABAABABBBABBABB We start with the shortest phrase on the left that we haven t seen before This will always be a single letter in this case A A ABABBBABAABABBBABBABB We now take the next phrase we haven t seen We ve already seen A so we take AB A AB ABBBABAABABBBABBABB The next phrase we haven t seen is ABB as we ve already seen AB Continuing we get B after that A AB ABB B ABAABABBBABBABB and you can check that the rest of the string parses into A AB ABB B ABA ABAB BB ABBA BB 1 Because we ve run out of letters the last phrase on the end is a repeated one That s O K Now how do we encode this For each phrase we see we stick it in a dictionary The next time we want to send it we don t send the entire phrase but just the number of this phrase Consider the following table 1 2 3 4 5 6 7 8 9 A AB ABB B ABA ABAB BB ABBA BB A 1B 2B B 2A 5B 4B 3A 7 The first row gives the numbers of the phrase the second row gives the phrases and the third row their encodings That is when we re encoding the ABAB the sixth phrase we encode it as 5B This maps to ABAB since the fifth phrase was ABA and we add B to it Here the empty set should be considered as the 0 th phrase and encoded by 0 The last piece is encoding this string into binary This gives 001110100101001011100101100111 To see how this works I ve now inserted dividers and commas to make it more comprehensible 0 0 1 1 10 1 00 1 010 0 101 1 100 1 011 0 0111 We have taken the third row of the previous array expressed all the numbers in binary before the comma and the letters in binary after the comma Note that I ve mapped A to 0 and B to 1 If you had a larger alphabet you would encode the letters by more than one bit In fact you could even use a Huffman code to encode the letters if you know the frequencies of your letters Note also that as soon as a reference to a phrase might conceivably involve k bits starting with the 2k 1 dictionary element I ve actually used k bits so the number of bits used before the comma keeps increasing This ensures that the decoding algorithm knows where to put the commas and dividers To decode the decoder needs to construct the same dictionary To do this he first takes the binary string he receives and inserts dividers and commas This is straightforward The first two dividers each come after 2 bits The next two each come after 3 bits We then get 2 2 of length 4 bits 23 of length 5 bits 24 of length 6 bits and in general 2k of length k 2 bits This is because when we encode our phrases if we have r phrases in our dictionary we use dlog 2 re bits to encode the number of the phrase to 2 ensure that the encodings of all phrases use the same number of bits You can see that in the example above the first time we encode AB phrase 2 we encode it as 10 and the second time we encode it as 010 This is because the first time we had three phrases in our dictionary and the second time we had five The decoder then uses the same algorithm to construct the dictionary as the encoder did He knows phrases 1 through r 1 when he is trying to figure out what the rth phrase is and this is exactly the information he might need to reconstruct the dictionary You might notice that in this case the compression algorithm actually makes the sequence longer This is the case for one of two reasons Either this original sequence was too random to be compressed much or it was too short for the asymptotic efficiency of Lempel Ziv to start being noticeable How well have we encoded the string Suppose we have broken it up into c n phrases where n is the length of the string Each phrase is broken up into a reference to a previous phrase and to a letter of our alphabet The previous phrase can be represented by at most dlog 2 c n e bits since there are c n phrases and the letter can be represented by at most dlog 2 e bits where is the size of the alphabet in the above example it is 2 We have thus used at most c n log 2 c n log 2 bits total in our encoding In practice you don t want to use too much memory for your dictionary Thus most implementations of Lempel Ziv type algorithms have some maximum size for the dictionary When it gets full they drop a little used word from the dictionary and replace it by the current word This also helps the algorithm adapt to messages with changing characteristics You need to use some deterministic algorithm for which word to drop so that both the sender and the receiver will drop the same word So how well does the Lempel Ziv algorithm work In this lecture we will calculate how well it works in the worst case It actually also works well both in the random case where each letter of the message is chosen uniformly and independently from a probability distribution as well as in the case where the letters of the message are chosen …
View Full Document