Huffman EncodingEntropyActual information contentFixed and variable bit widthsExample Huffman encodingWhy it worksCreating a Huffman encodingExample, step IExample, step IIExample, step IIIExample, step IVExample, step VUnique prefix propertyPractical considerationsAbout the exampleData compressionThe EndJan 13, 2019Huffman EncodingEntropyEntropy is a measure of information content: the number of bits actually required to store data.Entropy is sometimes called a measure of surpriseA highly predictable sequence contains little actual informationExample: 11011011011011011011011011 (what’s next?)Example: I didn’t win the lottery this weekA completely unpredictable sequence of n bits contains n bits of informationExample: 01000001110110011010010000 (what’s next?)Example: I just won $10 million in the lottery!!!!Note that nothing says the information has to have any “meaning” (whatever that is)Actual information contentA partially predictable sequence of n bits carries less than n bits of informationExample #1: 111110101111111100101111101100Blocks of 3: 111110101111111100101111101100Example #2: 101111011111110111111011111100Unequal probabilities: p(1) = 0.75, p(0) = 0.25Example #3: "We, the people, in order to form a..."Unequal character probabilities: e and t are common, j and q are uncommonExample #4: {we, the, people, in, order, to, ...}Unequal word probabilities: the is very commonFixed and variable bit widthsTo encode English text, we need 26 lower case letters, 26 upper case letters, and a handful of punctuationWe can get by with 64 characters (6 bits) in allEach character is therefore 6 bits wideWe can do better, provided:Some characters are more frequent than othersCharacters may be different bit widths, so that for example, e use only one or two bits, while x uses severalWe have a way of decoding the bit streamMust tell where each character begins and endsExample Huffman encodingA = 0B = 100C = 1010D = 1011R = 11ABRACADABRA = 01001101010010110100110This is eleven letters in 23 bitsA fixed-width encoding would require 3 bits for five different letters, or 33 bits for 11 lettersNotice that the encoded bit string can be decoded!Why it worksIn this example, A was the most common letterIn ABRACADABRA:5 As code for A is 1 bit long2 Rs code for R is 2 bits long2 Bs code for B is 3 bits long1 C code for C is 4 bits long1 D code for D is 4 bits longCreating a Huffman encodingFor each encoding unit (letter, in this example), associate a frequency (number of times it occurs)You can also use a percentage or a probabilityCreate a binary tree whose children are the encoding units with the smallest frequenciesThe frequency of the root is the sum of the frequencies of the leavesRepeat this procedure until all the encoding units are in the binary treeExample, step IAssume that relative frequencies are:A: 40B: 20C: 10D: 10R: 20(I chose simpler numbers than the real frequencies)Smallest number are 10 and 10 (C and D), so connect thoseExample, step IIC and D have already been used, and the new node above them (call it C+D) has value 20The smallest values are B, C+D, and R, all of which have value 20Connect any two of theseExample, step IIIThe smallest values is R, while A and B+C+D all have value 40Connect R to either of the othersExample, step IVConnect the final two nodesExample, step VAssign 0 to left branches, 1 to right branchesEach encoding is a path from the rootA = 0B = 100C = 1010D = 1011R = 11Each path terminates at a leafDo you see why encoded strings are decodable?Unique prefix propertyA = 0B = 100C = 1010D = 1011R = 11No bit string is a prefix of any other bit stringFor example, if we added E=01, then A (0) would be a prefix of ESimilarly, if we added F=10, then it would be a prefix of three other encodings (B=100, C=1010, and D=1011)The unique prefix property holds because, in a binary tree, a leaf is not on a path to any other nodePractical considerationsIt is not practical to create a Huffman encoding for a single short string, such as ABRACADABRATo decode it, you would need the code tableIf you include the code table in the entire message, the whole thing is bigger than just the ASCII messageHuffman encoding is practical if:The encoded string is large relative to the code table, ORWe agree on the code table beforehandFor example, it’s easy to find a table of letter frequencies for English (or any other alphabet-based language)About the exampleMy example gave a nice, good-looking binary tree, with no lines crossing other linesThat’s because I chose my example and numbers carefullyIf you do this for real data, you can expect your drawing will be a lot messier—that’s OKExample:Data compressionHuffman encoding is a simple example of data compression: representing data in fewer bits than it would otherwise needA more sophisticated method is GIF (Graphics Interchange Format) compression, for .gif filesAnother is JPEG (Joint Photographic Experts Group), for .jpg filesUnlike the others, JPEG is lossy—it loses informationGenerally OK for photographs (if you don’t compress them too much), because decompression adds “fake” data very similiar to the originalThe
View Full Document