Unformatted text preview:

Lecture 18:CompressionCS105: Great Insights in Computer ScienceMichael L. Littman, Fall 2006Overview• When we decide how to represent something in bits, there are some competing interests:• easily manipulated/processed• short• Common to use two representations:• one direct to allow for easy processing• one terse (compressed) to save storage and communication costsPlan• I’m going to try to describe one neat idea, implicit in Chapter 6: Huffman coding.• For more information, see wikipedia:• http://en.wikipedia.org/wiki/Huffman_codingGettysburg AddressFour score and seven years ago our fathers brought forth on this continent, a new nation, conceived in Liberty, and dedicated to the proposition that all men are created equal.Now we are engaged in a great civil war, testing whether that nation, or any nation so conceived and so dedicated, can long endure. We are met on a great battle-field of that war. We have come to dedicate a portion of that field, as a final resting place for those who here gave their lives that that nation might live. It is altogether fitting and proper that we should do this.But, in a larger sense, we can not dedicate -- we can not consecrate -- we can not hallow -- this ground. The brave men, living and dead, who struggled here, have consecrated it, far above our poor power to add or detract. The world will little note, nor long remember what we say here, but it can never forget what they did here. It is for us the living, rather, to be dedicated here to the unfinished work which they who fought here have thus far so nobly advanced. It is rather for us to be here dedicated to the great task remaining before us -- that from these honored dead we take increased devotion to that cause for which they gave the last full measure of devotion -- that we here highly resolve that these dead shall not have died in vain -- that this nation, under God, shall have a new birth of freedom -- and that government of the people, by the people, for the people, shall not perish from the earth.Character Counts• For simplicity, let’s turn the uppercase letters into lowercase letters. That leaves us with: 282 <s> 4 <b> 22 , 15 - 10 . 0 ? 102 a 14 b 31 c 58 d 165 e 27 f 28 g 80 h 68 i 0 j 3 k 42 l 13 m 77 n 93 o 15 p 1 q 79 r 44 s 126 t 21 u 24 v 28 w 0 x 10 y 0 zAttempt #1: ASCII• The standard format for representing characters uses 8 bits per character.• The address is 1482 characters long, so a total of 11856 bits is needed using this representation.• 8 bits per character• 11856 total bits• 100% the size of ASCII representation.Attempt #2: Compact• Note that, at least in its lowercase form, there are only 32 different characters needed.• Therefore, each can be assigned a 5-bit code (32 different 5-bits patterns).• 5 bits per character• 7410 total bits• 62.5% the size of ASCII representation.5-bit Patterns00000 <s>00001 <b>00010 ,00011 -00100 .00101 ?00110 a00111 b01000 c01001 d01010 e01011 f01100 g01101 h01110 i01111 j10000 k10001 l10010 m10011 n10100 o10101 p10110 q10111 r11000 s11001 t11010 u11011 v11100 w11101 x11110 y11111 zAttempt #3: Vary Length• Some characters are much more common than others.• Give the 4 most common characters a 3-bit code, and the remaining 28 a 6-bit code.• How many bits do we need now?Variable Length Patterns000 <s>001 e010 t011 a100000 o100001 h100010 r100011 n100100 i100101 d100110 s100111 l101000 c101001 w101010 g101011 f101100 v101101 ,101110 u101111 -110000 p110001 b110010 m110011 .110100 y110101 <b>110110 k110111 q111000 ?111001 j111010 x111011 zDecodability• Note that the code was chosen so that the first bit of each character tells you whether the code is short (0) or long (1).• This choice ensures that a message can actually be decoded:• 100001100100000010100001001100010001110011• h i <s> t h e r e .• 42 bits, not 45. But, harder to work with.What Gives?• We had assigned all 32 characters 5-bit codes.• Now we’ve got 4 that have 3-bit codes and 28 that are 6-bit codes. So, more than half of the characters have actually gotten longer.• How can that change help?• Need to factor in how many of each characters there are.Adding Up the Bits• How many bits to write down just the letter “y”? Well, there are 10 “y”s and each takes 6 bits. So, 60 bits. (It was 50, before.)• How about “t”? There are 126 and each takes 3 bits. That’s 378 (was 630).• So, how do we total them all up?• Let c be a character, freq(c) the number of times it appears, and len(c) its encoding length.•Total bits = !c freq(c) x len(c)Summing It Up• 282x3 + 165x3 + 126x3 +102x3 + 93x6+ 80x6 + 79x6 + ... + 0x6 + 0x6 = 6867 282 <s> 165 e 126 t 102 a 93 o 80 h 79 r 77 n 68 i 58 d 44 s 42 l 31 c 28 w 28 g 27 f 24 v 22 , 21 u 15 - 15 p 14 b 13 m 10 . 10 y 4 <b> 3 k 1 q 0 ? 0 j 0 x 0 zAttempt #3: Summary• Total for this example:• 4.6 bits per character (1482 characters)• 6867 total bits• 57.9% the size of ASCII representation.Attempt #4: Sorted0 <s>10 e110 t1110 a11110 o... • Total for this example:• 7.1 bits per character• 10467 total bits• 88.3% the size of ASCII representation.Attempt #5: Your Turn• Make sure it is decodable! 282 <s> 165 e 126 t 102 a 93 o 80 h 79 r 77 n 68 i 58 d 44 s 42 l 31 c 28 w 28 g 27 f 24 v 22 , 21 u 15 - 15 p 14 b 13 m 10 . 10 y 4 <b> 3 k 1 q 0 ? 0 j 0 x 0 zCan We Do Better?• Shannon invented information theory, which talks about bits and randomness and encodings.• Fano and Shannon worked together on finding minimal size codes. They found a good heuristic, but didn’t solve it.• Fano assigned the problem to his class.• Huffman solved it, not knowing his prof. had unsuccessfully struggled with it.Tree (Prefix) Code• First, notice that a code can be drawn as a tree.• Left = “0”, right = “1”. So, e = “001”, w = “101001”.• Tree structure ensures code is decodable: Bits tell you unambiguously which character.a<s> eto hr n i ds l c wg f v ,u -p bm . y <b>k q ? jx zHuffman Coding• Make each character a subtree (”block”) with count equal to its frequency.• Take two blocks with smallest counts and “merge” them into left and right branches. The count for the new block is the sum of the counts of the blocks it is made out of.•


View Full Document

Rutgers University CS 105 - Lecture 18: Compression

Download Lecture 18: Compression
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture 18: Compression and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture 18: Compression 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?