UIUC ECE 190 - Text Compression with Huffman Coding, - D2168552

Home> Schools> University of Illinois at Urbana, Champaign> Electrical and Computer Engineering (ECE) > ECE 190> Text Compression with Huffman Coding,

DOC PREVIEW

UIUC ECE 190 - Text Compression with Huffman Coding,

School name University of Illinois at Urbana, Champaign

Course Ece 190- Intro to Computing Systems

Pages 7

This preview shows page 1-2 out of 7 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 7 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 7 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 7 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

ECE190 MP5  Text Compression with Huffman Coding, Spring 2010 ASCII coding is inefficient. Compare for instance the traps PUTS and PUTSP in the LC‐3 ISA. The PUTS trap requires a single 8‐bit ASCII character to be placed in a single 16‐bit memory location. However, the PUTSP trap requires two 8‐bit ASCII characters to be placed in one 16‐bit memory location on the LC‐3. For a finite amount of memory, for instance 100 locations, PUTSP can store twice as many characters as PUTS. This is the idea behind the following programming assignment, but we will take it farther and utilize a more efficient coding method than ASCII to store text. The coding method is called Huffman coding and is in fact as efficient as a coding scheme can be‐‐it is provably optimal! Consider the length 8 input string "aaaabbcd". How many different letters does this string contain? Four: 'a' occurs 4 times, 'b' occurs 2 times, and both 'c' and 'd' occursonce. The total string length is 8. Thus, 'a' occurs half the time (the most frequently), 'b' occurs 2/8 of the time, and both 'c' and 'd' occur 1/8 of the time (the least frequently). The compression idea we are after is to assign a unique binary string for each letter, where the most frequently occurring letter gets the shortest string so that it takes the least amount of memory to store, and the least frequently occurring letter gets the longest string. For this example, such an encoding would be: 'a' is "1", 'b' is "01", 'c' is "001", and 'd' is "000". The unique binary string for each character actually has a stronger property than uniqueness‐‐it is prefix‐free, which gives the ability to avoid confusing the binary strings of letters. This has a very elegant relationship with the finite‐state machines studied earlier in the course. Compare the binary strings for each letter to ASCII encoding for each letter. In ASCII (and neglecting any architecture‐specific details, such as the 16‐bit addressability considered above in the LC‐3 example) any string of length 8 takes (8 characters)*(8 bits per character)=64 bits. In Huffman code, we see that this takes only (4 'a's)*(1 bit per 'a')+(2 'b's)*(2 bits per 'b')+(1 'c')*(3 bits per 'c')+(1 'd')*(3 bits per 'd')=14 bits! Even in the worst case where the length 8 string is composed of 8 different letters, such as in "abcdefgh", the encoding would be 20 bits, still a large improvement over the 64 bits required for ASCII. Milestone 1  Huffman Tree Generation and Text Compression The task for the first milestone is given a text file of ASCII characters, take the file's contents (assumed to be less than some fixed length to avoid annoying edge cases, see the given code), generate a binary tree to determine an optimal encoding and encode the given string. Note that we will be skipping an important step‐‐the output of your program will still be in ASCII, just written as the codes. This saves a lot of tedious work. So for the example "aaaabbcd", the input and output of the program are: ./mp5.1 -c tests/test0.txt test.txt contents are: aaaabbcd (with no newline or that would also be encoded, see the given test files) where ‐c stands for "compress" and executing mp5 on the test0.txt file creates an output file tests/test0.txt.hc with contents: 11110101001000 (these are ASCII ones and zeros, so '1' and '0', with no newline at the end, see the given test files) The major steps of the Huffman coding algorithm are: 1. Read ASCII text from a specified input file into a buffer in memory. THIS MUST BE DONE USING fopen() and getc() (or fscanf, just some function that works with file streams). 2. Generate an initial forest of Huffman trees based on the contents of this buffer. A Huffman tree is a binary tree. A forest is a disjoint union of trees‐‐for our purposes, a forest is just a singly linked‐list. The data type for the Huffman tree and the forest is a singly‐linked list of binary trees: typedef struct htree_t { char letter; /* letter represented by this tree */ int weight; /* number of times letter appears */ struct htree_t* left; /* pointer to left subtree */ struct htree_t* right; /* pointer to right subtree */ struct htree_t* next; /* pointer to next tree in the forest */ } HTREE; where letter is the letter specified by the tree, weight is the number of times the letter appears in the ASCII text, left is the pointer to the left child, right is the pointer to the right child, and next is the pointer to the next Huffman tree in the linked‐list. Implement the function appendForest to add new trees to the forest, and implement the function searchForestLetter to determine if a letter already has a corresponding tree in the list of trees and you should increment its weight. You must use malloc() for the generation of new entries of this data structure as you will be appending, updating, and removing entries from the forest as a part of the algorithm. The initial forest of Huffman trees for the example of "aaaabbcd" is: 3. Run the Huffman coding algorithm on the initial forest. This is done with a call to the huffmanize function (hint: recall pass‐by‐value versus pass‐by‐reference): a. Find the tree (called smallestTree) in the forest with the lowest weight: ties are broken by taking the last occurrence of the lowest weight tree using the functions searchForestMin and searchForestMinHelp. In the example above, this would be the tree corresponding to the letter ‘d’. COMPARE TO THE GOLD FILE OUTPUT AS THERE ARE MANY POSSIBLE UNIQUE ENCODINGS, BUT YOU MUST BREAK TIES IN THIS WAY OR YOU WILL RECEIVE NO CREDIT BECAUSE YOUR OUTPUT WILL BE DIFFERENT THAN THE GOLD FILE. b. Remove this tree from the forest using the function cutForest. Hint: do not free() the memory, just change the pointers in the list. c. Repeat 3a and 3b to find and remove the next smallest weight tree, called smallTree. In the example above, this would be the tree corresponding to the letter ‘c’. d. Create a new tree called newTree with no letter, weight equal to the sum of the weights of smallestTree and smallTree, left equal to smallestTree, and right equal to smallTree. e. Add newTree to the forest using appendForest. f. Recursively repeat 3a to 3e until there is a single tree in the forest (that means until next = NULL for every tree in the list). The single tree in the forest is called the

View Full Document