Lecture-compress

Home> Academic Documents> Lecture-compress

DOC PREVIEW

Lecture-compress

This preview shows page 1-2 out of 7 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 7 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 7 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 7 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

115-853 Page115-583:Algorithms in the Real WorldData Compression IIILempel-Ziv algorithmsBurrows-WheelerIntroduction to Lossy Compression15-853 Page2Lempel-Ziv AlgorithmsLZ77 (Sliding Window)• Variants: LZSS (Lempel-Ziv-Storer-Szymanski)• Applications: gzip, Squeeze, LHA, PKZIP, ZOOLZ78 (Dictionary Based)• Variants: LZW (Lempel-Ziv-Welch), LZC (Lempel-Ziv-Compress)• Applications: compress, GIF, CCITT (modems), ARC, PAKTraditionally LZ77 was better but slower, but the gzip version is almost as fast as any LZ78.15-853 Page3LZ77: Sliding Window Lempel-ZivDictionary and buffer “windows” are fixed length and slide with the cursorOn each step:• Output (p,l,c)p = relative position of the longest match in the dictionaryl = length of longest matchc = next char in buffer beyond longest match• Advance window by l + 1a a c a a c a b c a b a b a cDictionary(previously coded)LookaheadBufferCursor15-853 Page4LZ77: Examplea a c a a c a b c a b a a a c (0,0,a)a a c a a c a b c a b a a a c (1,1,c)a a c a a c a b c a b a a a c (3,4,b)a a c a a c a b c a b a a a c (3,3,a)a a c a a c a b c a b a a a c (1,2,c)Dictionary (size = 6)Longest match Next character215-853 Page5LZ77 DecodingDecoder keeps same dictionary window as encoder.• For each message it looks it up in the dictionary and inserts a copyWhat if l > p? (only part of the message is in the dictionary.)• E.g. dict = abcd, codeword = (2,9,e)• Simply copy starting at the cursorfor (i = 0; i < length; i++)out[cursor+i] = out[cursor-offset+i]•Out = abcdcdcdcdcdc15-853 Page6LZ77 Optimizations used by gzipLZSS: Output one of the following formats(0, position, length) or (1,char)Typically use the second format if length < 3.a a c a a c a b c a b a a a c (1,a)a a c a a c a b c a b a a a c (1,a)a a c a a c a b c a b a a a c (0,3,4)a a c a a c a b c a b a a a c (1,c)15-853 Page7Optimizations used by gzip (cont.)• Huffman code the positions, lengths and chars• Non greedy: possibly use shorter match so that next match is better• Use hash table to store dictionary. – Hash is based on strings of length 3. – Find the longest match within the correct hash bucket. – Limit on length of search.– Store within bucket in order of position15-853 Page8Theory behind LZ77Sliding Window LZ is Asymptotically Optimal [Wyner-Ziv,94]Will compress long enough strings to the source entropy as the window size goes to infinity.Uses logarithmic code for position.Problem: “long enough” is really really long.∑∈=nAXnXpXpH)(1log)(nnHH∞→= lim315-853 Page9LZ78: Dictionary Lempel-ZivBasic algorithm:• Keep dictionary of words with integer id for each entry (e.g. keep it as a trie).• Coding loop– find the longest match S in the dictionary– Output the entry id of the match and the next character past the match from the input (id,c)– Add the string Sc to the dictionary• Decoding keeps same dictionary and looks up ids15-853 Page10LZ78: Coding Examplea a b a a c a b c a b c b (0,a)1 = aDict.Outputa a b a a c a b c a b c b (1,b)2 = aba a b a a c a b c a b c b (1,a)3 = aaa a b a a c a b c a b c b (0,c)4 = ca a b a a c a b c a b c b (2,c)5 = abca a b a a c a b c a b c b (5,b)6 = abcb15-853 Page11LZ78: Decoding Examplea(0,a)1 = aa a b(1,b)2 = aba a b a a(1,a)3 = aaa a b a a c(0,c)4 = ca a b a a c a b c(2,c)5 = abca a b a a c a b c a b c b(5,b)6 = abcbInputDict.15-853 Page12LZW (Lempel-Ziv-Welch) [Welch84]Don’t send extra character c, but still add Sc to the dictionary.The dictionary is initialized with byte values being the first 256 entries (e.g. a = 112, ascii), otherwise there is no way to start it up.The decoder is one step behind the coder since it does not know c• Only an issue for strings of the form SSc where S[0] = c, and these are handled specially415-853 Page13LZW: Encoding Examplea a b a a c a b c a b c b 112256=aaDict.Outputa a b a a c a b c a b c b257=aba a b a a c a b c a b c b 113258=baa a b a a c a b c a b c b 256259=aaca a b a a c a b c a b c b 114260=caa a b a a c a b c a b c b 257261=abc112a a b a a c a b c a b c b 260262=cab15-853 Page14LZ78 and LZW issuesWhat happens when the dictionary gets too large?• Throw the dictionary away when it reaches a certain size (used in GIF)• Throw the dictionary away when it is no longer effective at compressing (used in unix compress)• Throw the least-recently-used (LRU) entry away when it reaches a certain size (used in BTLZ, the British Telecom standard)15-853 Page15LZW: Decoding Examplea a b a a c a b c a b c b112256=aaa a b a a c a b c a b c b257=aba a b a a c a b c a b c b113258=baa a b a a c a b c a b c b256259=aaca a b a a c a b c a b c b114260=caa a b a a c a b c a b c b257261=abc112a a b a a c a b c a b c b260Input Dict15-853 Page16Lempel-Ziv Algorithms Summary Both LZ77 and LZ78 and their variants keep a “dictionary” of recent strings that have been seen.The differences are:• How the dictionary is stored• How it is extended• How it is indexed• How elements are removed515-853 Page17Lempel-Ziv Algorithms Summary (II)Adapt well to changes in the file (e.g. a Tar file with many file types within it).Initial algorithms did not use probability coding and perform very poorly in terms of compression (e.g. 4.5 bits/char for English)More modern versions (e.g. gzip) do use probability coding as “second pass” and compress much better.Are becoming outdated, but ideas are used in many of the newer algorithms.15-853 Page18Burrows -WheelerCurrently best “balanced” algorithm for textBasic Idea:• Sort the characters by their full context (typically done in blocks). This is called the block sorting transform.• Use move-to-front encoding to encode the sorted characters.The ingenious observation is that the decoder only needs the sorted characters and a pointer to the first character of the original sequence.15-853 Page19Burrows Wheeler: ExampleLet’s encode: d1e2c3o4d5e6We’ve numbered the characters to distinguish them.Context “wraps” around.Context Char ecode6 d1 coded1 e2 odede2 c3 dedec3 o4 edeco4 d5 decod5 e6 Context Output dedec3 o4 coded1 e2 decod5 e6 odede2 c3 ecode6 d1 ⇐ edeco4 d5 SortContext15-853 Page20Burrows-Wheeler (Continued)Theorem: After sorting, equal valued characters appear in the same order in the output as in the most significant position of the context. Proof: Since the chars have equal value in the most-sig-position of the context, they will be ordered by the rest of …


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2 out of 7 pages.

Please select your school