UT CS 337 - Compression - Huffman’s Algorithm

Unformatted text preview:

Compression: Huffman’s AlgorithmGreg PlaxtonTheory in Programming Practice, Fall 2005Department of Computer ScienceUniversity of Texas at AustinTree Representation of a Prefix Code• Consider the infinite binary tree T in which each leftgoing edge islabeled with a 0 and each rightgoing edge is labeled with a 1• We define the label of any node x in T as the concatenation of theedge labels on the unique downward path from the root to x– There is a one-to-one correspondence between the nodes of T and{0, 1}∗• A prefix code may be represented as the (unique, finite) subtree of Tthat contains the root and has leaf labels equal to the set of codewords– Such a representation is possible because no two codewords lie onthe same downward path from the rootTheory in Programming Practice, Plaxton, Fall 2005Convention Concerning Zero-Frequency Symbols• Let A denote the input alphabet• For each a in A, let f(a) denote the frequency of a– Remark: Scaling the frequencies by any positive quantity does notchange the problem of computing an optimal code– We will adopt the convention that a symbol with frequency zeroshould not be assigned a codeword in an optimal code– For the remainder of this lecture, assume that A has beenpreprocessed to remove any zero-frequency symbols, so f(a) > 0 forall a in ATheory in Programming Practice, Plaxton, Fall 2005Full Binary Tree• A (finite) binary tree is full if every internal node has exactly twochildren• Theorem: Any optimal prefix code corresponds to a full binary tree• Proof: The weight of a prefix code corresponding to a nonfull binarytree strictly decreases when we contract the lone descendant edge ofof any single-child node• Theorem: Any optimal prefix code tightly satisfies the McMillaninequality (i.e., the inequality is an equality)• Hint: Use induction on the number of leaves; for the induction step,remove two sibling leaves at the deepest levelTheory in Programming Practice, Plaxton, Fall 2005Huffman’s Algorithm• The algorithm manipulates a forest of full binary trees– Invariant: The total number of leaves is n = |A|, and each leaf islabeled by a distinct symbol in A– Initially, there are n single-node trees– The algorithm repeatedly merges pairs of trees– The algorithm terminates when there is one tree, the output tree• To merge trees T1and T2, we introduce a new node with left subtreeT1and right subtree T2– Remark: It doesn’t matter which tree becomes the left or right childof the merged tree• It remains only to specify which pair of trees to merge in each iterationTheory in Programming Practice, Plaxton, Fall 2005Huffman’s Algorithm: Merging Rule• Let us define the frequency of a tree T in the forest as the sum of thefrequencies of the symbols associated with (the leaves of) T• We merge the two lowest-frequency trees, breaking ties arbitrarily• Remark: Due to the nondeterminism associated with tie-breaking,Huffman’s algorithm can produce different codes; in fact, it can producecodes with different distributions of codeword lengthsTheory in Programming Practice, Plaxton, Fall 2005Huffman’s Algorithm: Recursive Formulation• Let a and b be the two lowest-frequency symbols in A, breaking tiesarbitrarily• Let A0= (A \ {a, b}) ∪ {c}, where c is a new symbol with f(c) =f(a) + f(b)• Recursively compute the tree representation of an optimal prefix codefor A0• Modify this tree by adding two children labeled a and b to the leaflabeled c (which ceases to be a leaf)– The resulting tree corresponds to an optimal prefix code for ATheory in Programming Practice, Plaxton, Fall 2005Huffman’s Algorithm: Proof of Correctness• Not hard to convince yourself that the iterative and recursive versionsof Huffman’s algorithm are equivalent• Our proof is given in terms of the recursive versionTheory in Programming Practice, Plaxton, Fall 2005Huffman’s Algorithm: Proof of Correctness• Let X denote the set of tree representations of all prefix codes for A inwhich the symbols a and b are associated with sibling leaves– Lemma: The set X contains the tree representation of an optimalprefix code• Let Y denote the set of tree representations of all prefix codes for A0– Observation: There is a natural one-to-one correspondence betweenthe sets X and Y ; for any tree T in Y , let h(T ) denote thecorresponding tree in X– Lemma: For any tree T in Y , the weight of h(T ) is equal to theweight of T plus f(a) + f(b)– It follows that any minimum-weight tree T in Y corresponds to aminimum-weight tree h(T ) in XTheory in Programming Practice, Plaxton, Fall 2005Remarks on Optimal Codes• While there is always an optimal prefix code, not every optimal code isa prefix code– Example involving the alphabet {a, b, c} with equal frequencies?• Two optimal prefix codes (for the same input) can have a differentdistribution of codeword lengths– Example involving the alphabet {a, b, c, d}?Theory in Programming Practice, Plaxton, Fall 2005Implementation of Huffman’s Algorithm• Use a priority queue data structure (e.g., a binary heap) that supportsthe operations insert and delete-min in O(log n) time• Initially, we insert each of the n symbols into the priority queue• At each iteration (or level of recursion) we remove two entries from thepriority queue and insert one• The total cost of the 2n − 1 insert (resp., delete-min) operations isO(n log n)• The total cost of all other operations is O(1) per iteration (or level ofrecursion), hence O(n) overallTheory in Programming Practice, Plaxton, Fall 2005Linear-Time Implementation• Assume that the “leaves” (i.e., symbols) are provided to us in a queuesorted in nondecreasing order of frequency• As the algorithm progresses, we can find the lowest-frequencyunprocessed leaf in O(1) time• But what about the “nonleaf” nodes, one of which is created at eachiteration?– When a nonleaf is created, append it to a second queue of nonleaves(initially empty)– Invariant: The nonleaf queue is sorted in nondecreasing order offrequency– To implement delete-min, we simply compare the frequencies ofthe nodes at the heads of the leaf and nonleaf queues, dequeueingwhichever is lowerTheory in Programming Practice, Plaxton, Fall 2005Optimal Prefix Codes: Summary• For an i.i.d. source, the average number of bits used to encode eachsymbol is within one of the entropy lower bound– Except in cases where the entropy is extremely close to zero,


View Full Document

UT CS 337 - Compression - Huffman’s Algorithm

Documents in this Course
Load more
Download Compression - Huffman’s Algorithm
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Compression - Huffman’s Algorithm and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Compression - Huffman’s Algorithm 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?