Algorithms in Java, 4th Edition·Robert Sedgewick and Kevin Wayne·Copyright © 2008·April 15, 2008 11:25:10 PMData CompressionReferences: Algorithms 2nd edition, Chapter 22 http://www.cs.princeton.edu/algs4/65compression‣fixed-length codes‣variable-length codes‣an application‣adaptive codes2Data compressionCompression reduces the size of a file:•To save space when storing it.•To save time when transmitting it.•Most files have lots of redundancy.Who needs compression?•Moore's law: # transistors on a chip doubles every 18-24 months.•Parkinson's law: data expands to fill space available. •Text, images, sound, video, …Basic concepts ancient (1950s), best technology recently developed.“ All of the books in the world contain no more information than is broadcast as video in a single large American city in a single year. Not all bits have equal value. ” — Carl Sagan3ApplicationsGeneric file compression.•Files: GZIP, BZIP, BOA.•Archivers: PKZIP.•File systems: NTFS.Multimedia.•Images: GIF, JPEG. •Sound: MP3.•Video: MPEG, DivX™, HDTV.Communication.•ITU-T T4 Group 3 Fax.•V.42bis modem.Databases. Google.4Encoding and decodingMessage. Binary data M we want to compress.Encode. Generate a "compressed" representation C(M).Decode. Reconstruct original message or some approximation M'.Compression ratio. Bits in C(M) / bits in M.Lossless. M = M', 50-75% or lower.Ex. Natural language, source code, executables.Lossy. M ≈ M', 10% or lower.Ex. Images, sound, video.EncoderMDecoderC(M) M'uses fewer bits (you hope)this lecture5Food for thoughtData compression has been omnipresent since antiquity:•Number systems. •Natural languages. •Mathematical notation.has played a central role in communications technology,•Braille.•Morse code.•Telephone system. and is part of modern life.•MP3.•MPEG.Q. What role will it play in the future?6What data can be compressed?US Patent 5,533,051 on "Methods for Data Compression", which is capable of compression all files.Slashdot reports of the Zero Space Tuner™ and BinaryAccelerator™.“ ZeoSync has announced a breakthrough in data compression that allows for 100:1 lossless compression of random data. If this is true, our bandwidth problems just got a lot smaller.… ” 7Perpetual motion machinesUniversal data compression is the analog of perpetual motion.Closed-cycle mill by Robert Fludd, 1618Gravity engine by Bob SchadewaldReference: Museum of Unworkable Devices by Donald E. Simanekhttp://www.lhup.edu/~dsimanek/museum/unwork.htm8What data can be compressed?Proposition. Impossible to losslessly compress all files.Pf 1.•consider all 1,000 bit messages.•21000 possible messages.•only 2999 + 2998 + … + 1 can be encoded with ≤ 999 bits.•only 1 in 2499 can be encoded with ≤ 500 bits!Pf 2 (by contradiction).•given a file M, compress it to get a smaller file M1.•compress that file to get a still smaller file M2.•continue until reaching file size 0.•implication: all files can be compressed with 0 bits!9Rdenudcany in Enlgsih lnagugaeQ. How much redundancy is in the English language?A. Quite a bit.“ ... randomising letters in the middle of words [has] little or no effect on the ability of skilled readers to understand the text. This is easy to denmtrasote. In a pubiltacion of New Scnieitst you could ramdinose all the letetrs, keipeng the first two and last two the same, and reibadailty would hadrly be aftcfeed. My ansaylis did not come to much beucase the thoery at the time was for shape and senqeuce retigcionon. Saberi's work sugsegts we may have some pofrweul palrlael prsooscers at work. The resaon for this is suerly that idnetiyfing coentnt by paarllel prseocsing speeds up regnicoiton. We only need the first and last two letetrs to spot chganes in meniang. ” — Graham Rawlinson10‣fixed-length codes‣variable-length codes‣an application‣adaptive codes11Fixed-length coding•Use same number of bits for each symbol.•k-bit code supports 2k different symbols.Ex. 7-bit ASCII code.chardecimalcodeNUL00000000......a971100001b981100010c991100011d1001100100......DEL1271111111abracadabra!11000011100010111001011000011100011110000111001001100001110001011100101100001111111112 symbols × 7 bits per symbol = 84 bits in code12Fixed-length coding•Use same number of bits for each symbol.•k-bit code supports 2k different symbols.Ex. 3-bit custom code.charcodea000b001c010d011r100!111abracadabra!00000110000001000001100000110000011112 symbols × 3 bits per symbol = 36 bits in code13Fixed-length coding: general scheme•Count number of different symbols.•~ lg M bits suffice to support M different symbols.Ex. Genomic sequences.•4 different nucleotides.•2 bits suffice.•Amazing but true: initial databases in 1990s did not use such a code!Important detail. Decoder needs to know the code!charcodea00c01t10g11actacagatgaa000110000100110010110000~ 2N bits to encode genome with N nucleotides 2-bit DNA code14‣fixed-length codes‣variable-length codes‣an application‣adaptive codesUse different number of bits to encode different symbols.Ex. Morse code.Issue. Ambiguity. • • • − − − • • •SOS ?IAMIE ?EEWNI ?V7O ?Remark. Separate words with medium gap. 15Variable-length codingQ. How do we avoid ambiguity?A. Ensure that no codeword is a prefix of another.Ex 1. Fixed-length code.Ex 2. Append special stop symbol to each codeword.Ex 3. Custom prefix-free code.16Variable-length codingcharcodeS• • •E•I••V•••-prefix of Vprefix of I, Sprefix of Sabracadabra!0111110010100100011111001011charcodea0b111c1010d100r110!110128 bits to encode message (vs. 36 bits for fixed-length 3-bit code)17Q. How to represent a prefix-free code?A. Binary trie.•Symbols are stored in leaves.•Codeword is path to leaf.Encoding.•Method 1: start at leaf; follow path upto the root, and print bits in reverse order.•Method 2: create ST of symbol-codeword pairs.Decoding.•Start at root of tree.•Go left if bit is 0; go right if 1. •If leaf node, print symbol and return to root.Prefix-free code: encoding and decodingad0c !r b10 10 10 10 1charcodea0b111c1010d100r110!101118Representing the codeQ. How to transmit the trie?A. Send preorder traversal of trie.Note. If message is long, overhead of transmitting trie is small.*a**d*c!*rb120111110010100100011111001011preorder traversal# chars to decodethe message bits(pack 8 to the byte)ad0c !r b10
View Full Document