Text CompressionSlide 2Text Compression: ExamplesHuffman coding: go go gophersHuffman CodingBuilding a Huffman treeBuilding a treeSlide 8Slide 9Slide 10Slide 11Slide 12Slide 13Slide 14Slide 15Slide 16Slide 17Slide 18Slide 19Slide 20Slide 21Slide 22Slide 23Slide 24EncodingProperties of Huffman codingWriting code out to fileDecoding a messageSlide 29Slide 30Slide 31Slide 32Slide 33Slide 34Slide 35Slide 36Slide 37Slide 38Slide 39Slide 40Slide 41Slide 42Slide 43Slide 44Slide 45Slide 46Slide 47Slide 48DecodingHuffman Tree 2Slide 51Slide 52Slide 53Slide 54Slide 55Slide 56Slide 57Slide 58Other methodsCompSci 100E34.1Text CompressionInput: String SOutput: String SShorterS can be reconstructed from SCompSci 100E34.2Text Compression… He said to his friend, "If the British marchBy land or sea from the town to-night,Hang a lantern aloft in the belfry archOf the North Church tower as a signal light,--One if by land, and two if by sea;And I on the opposite shore will be,Ready to ride and spread the alarmThrough every Middlesex village and farm,For the country folk to be up and to arm." … Henry Wadsworth Longfellow (1807-1882)By land 1By sea 2CompSci 100E34.3Text Compression: ExamplesSymbolASCII FixedlengthVar.lengtha 01100001000 000b 01100010001 11c 01100011010 01d 01100100011 001e 01100101100 10“abcde” in the different formatsASCII: 01100001011000100110001101100100…Fixed: 000001010011100 Var: 000110100110000000011 11a b c d eadbc e0000 1111EncodingsASCII: 8 bits/characterUnicode: 16 bits/characterCompSci 100E34.4Huffman coding: go go gophersEncoding uses tree:0 left/1 rightHow many bits? 37!!Savings? Worth it? ASCII(7bits) 3 bits Huffmang 103 1100111 000 00o 111 1101111 001 01p 112 1110000 010 1100h 104 1101000 011 1101e 101 1100101 100 1110r 114 1110010 101 1111s 115 1110011 110 101sp. 32 1000000 111 1013s1*22p1h12e1r14g3o3632p1h12e1r14s1*27g3o3613CompSci 100E34.5Huffman CodingD.A Huffman in early 1950’sBefore compressing data, analyze the input streamRepresent data using variable length codesVariable length codes though Prefix codesEach letter is assigned a codewordCodeword is for a given letter is produced by traversing the Huffman treeProperty: No codeword produced is the prefix of anotherLetters appearing frequently have short codewords, while those that appear rarely have longer onesHuffman coding is optimal per-character coding methodCompSci 100E34.6Building a Huffman treeBegin with a forest of single-node trees (leaves)Each node/tree/leaf is weighted with character countNode stores two values: character and countThere are n nodes in forest, n is size of alphabet?Repeat until there is only one node left: root of treeRemove two minimally weighted trees from forestCreate new tree with minimal trees as children,oNew tree root's weight: sum of children (character ignored)Does this process terminate? How do we get minimal trees?Remove minimal trees, hmmm……CompSci 100E34.7Building a tree“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”11 6I5E5N1C1F1P2U2R2L2D2G3T3O3B3A4M4SCompSci 100E34.8Building a tree“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”11 6I5E5N1C1F1P2U2R2L2D2G3T3O3B3A4M4S22CompSci 100E34.9Building a tree“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”11 6I5E5N1F1C1P2U2R2L2D2G3T3O3B3A4M4S2233CompSci 100E34.10Building a tree“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”11 6I5E5N1F1C1P2U2R2L2D2G3T3O3B3A4M4S223344CompSci 100E34.11Building a tree“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”11 6I5E5N1F1C1P2U2R2L2D2G3T3O3B3A4M4S22334444CompSci 100E34.12Building a tree“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”11 6I5E5N1F1C1P2U2R2L2D2G3O3T3B3A4M4S233444455CompSci 100E34.13Building a tree“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”11 6I5E5N1F1C1P2U2R2L2D2G3O3T3B3A4M4S2 344445566CompSci 100E34.14Building a tree“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”11 6I5E5N1F1C1P2U2R2L2D2G3O3T3B3A4M4S2 34444556666CompSci 100E34.15Building a tree“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”11 6I5E5N1F1C1P2U2R2L2D2G3O3T3B3A4M4S2 3444455666886CompSci 100E34.16Building a tree“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”11 6I5E5N1F1C1P2U2R2L2D2G3O3T3B3A4M4S2 3445566688688CompSci 100E34.17Building a tree“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”11 6I5N5E1F1C1P2U2R2L2D2G3O3T3B3A4M4S2 34455666886881010CompSci 100E34.18Building a tree“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”11 6I5N5E1F1C1P2U2R2L2D2G3O3T3B3A4M4S2 3445668868810101111CompSci 100E34.19Building a tree“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”116I5N5E1F1C1P2U2R2L2D2G3O3T3B3A4M4S2 3445688688101011111212CompSci 100E34.20Building a tree“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”116I5N5E1F1C1P2U2R2L2D2G3O3T3B3A4M4S2 344568681610101111121216CompSci 100E34.21Building a tree“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”116I5N5E1F1C1P2U2R2L2D2G3O3T3B3A4M4S2 344568681610211112121621CompSci 100E34.22Building a tree“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”116I5N5E1F1C1P2U2R2L2D2G3O3T3B3A4M4S2 34456868161021111223162123CompSci 100E34.23Building a tree“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”116I5N5E1F1C1P2U2R2L2D2G3O3T3B3A4M4S2 34456868161021111223372337CompSci 100E34.24Building a tree“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”116I5N5E1F1C1P2U2R2L2D2G3O3T3B3A4M4S2 34456868161021111223376060CompSci 100E34.25Encoding1. Count occurrence of all occurring character O( )2. Build priority queue O( )3. Build Huffman tree O( )4. Create Table of codes from tree O( )5. Write Huffman tree and coded data to file O( )NAA log AA log ANCompSci 100E34.26Properties of Huffman codingWant to minimize weighted path length L(T)of tree Twi is the weight or count of each codeword idi is the leaf corresponding to codeword iHow do we calculate character (codeword) frequencies?Huffman coding creates pretty full bushy trees?When would it produce a “bad” tree?How do we produce coded compressed data from input
View Full Document