DOC PREVIEW
Princeton COS 333 - Software Explorations

This preview shows page 1-2-3-4 out of 11 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 11 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 11 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 11 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 11 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 11 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Software ExplorationsTHE BIG SQUEEZEBY JON BENTLEYEnglish text has a lot of redundancy. People are prettygood at squeezing out redundancy and then expanding itback again. That’s why we can read advertisements likethis:WASH DC Lg furn hse avail Jan 93. Move-in cond! 3stories, form din rm (almost nu china incld), ball rm,fam rm, fin bsmt w/sep entr, wd flrs, fpls, lg encl yrdw/vu, xcel loc nr pk, bulletprf wndws, w/many charmarchitect details.Data compression is important for computers, too.Squeezing data can reduce the size of disk files, cut tele-phone transmission costs, and increase the effectivebandwidth of video channels. In this column we’ll studycompression techniques by tackling a venerable prob-lem.The ProblemLet’s start with the rules of the game. Our problem isto see how much we can squeeze a dictionary. We don’tneed to implement compression and expansion pro-grams. Our task is only to measure the effect of thecompression — prototyping should always precedebuilding. We can use any widely available UNIX tool.Our dictionary is really a sorted word list: it contains71661 words, one word per line, for a total of 691752characters. There is no additional information, such aspronunciation, definition, or etymology. All charactersare lower-case letters. We’ll soon see the first few linesof the dictionary. I made the dictionary by translating allcapital letters in my system’s dictionary to lower case,deleting words that contain nonalphabetic characters,and then removing duplicate words:tr ’A-Z’ ’a-z’ </usr/dict/words |grep -v ’[ˆa-z]’ | uniq >dict.inExercise 1. Before you read further, jot down somemethods that you might use to compress the dictio-nary, and estimate the effectiveness of each. All-in-all,by what factor do you think you could squeeze the dic-tionary?The clean and precise rules identify this task as aclassroom exercise, but similar problems arise in a hostof applications. A list of English words is a critical com-ponent in programs like text editors and spelling check-ers. A compressed dictionary takes less space to store,Software Explorations — Galley 2less time to transmit over communication lines, and isoften faster to read from disk.Quick and a Fair Code, Five BitsEach character in the dictionary is represented on mymachine by an 8-bit byte. But 8 bits can represent28= 256 different characters, and we are wasting themon just 27 different characters (26 letters and newline).Because 25= 32, we can represent our dictionary using a5-bit code. This will reduce the space required by thedictionary to just 62.5% of the original representation.According to the rules of our game, we’ve solved theproblem. We have described a scheme and analyzed itsperformance. We could leave the encoding and decod-ing programs as ‘‘implementation details’’ that are out-side the scope of this exercise. (The best programmers Iknow have big lazy streaks right down the middle of theirbacks.)Exercise 2. Implement programs to encode anddecode 5-bit characters.But for consistency with later programs (and for use inlater pipelines), I wanted to have a program that wouldtell how many bits a file uses under a five-bit encoding ina command like this:bitcost 5 dict.inHere is a shell implementation of bitcost:expr $1 "*" ‘wc -c $2‘The -c tells word count to count the number of charac-ters; the backquotes cause the output to be placed in thecommand line. The expr program evaluates its argu-ments as an arithmetic expression.Exercise 3. Write a bitcost program that reads itsstandard input if it is not given a second argument.Try implementing bitcost using other tools, such asAwk or ls. Extend bitcost to check whether thereare at most 2bitsdistinct characters in the file.Common Prefix EliminationA common technique in data compression is ‘‘differen-tial coding’’. If we are transmitting a series of similarobjects, we send only the differences between objects.In television images, for instance, where large chunks ofbackground are unchanged we may transmit only thechanges from the last screen image.Because the words in the dictionary are sorted, a typi-cal word has many letters in common with its predeces-sor. We’ll therefore represent each word by a count ofthe letters it has in common with its predecessor followedby the letters that differ. The original dictionary is shownon the left, and the encoded version is on the right:Software Explorations — Galley 3a 0aaardvark 1ardvarkaardwolf 4wolfaaron 3onaaronic 5icab 1baba 2aabaca 3caabaci 4iaback 4k...The third line asserts that the word ‘‘aardwolf’’ has 4 let-ters in common with ‘‘aardvark’’, and then the letters‘‘wolf’’.Here is an Awk implementation of the preflen pro-gram to perform prefix-length encoding:{ for (i = 1; substr($1, i, 1) ==substr(last, i, 1); i++) ;print i-1 substr($1, i)last = $1}Because there is no pattern, the action in braces is exe-cuted for each input line. The for loop in the first twolines sets i to the number of characters that the inputword has in common with the last word. The next linewrites the encoded line, and the final line sets last tothe current input word.Awk is a fine implementation language for this task:the code is compact and easy to understand. It is also alittle slow: about four minutes on a VAX 8550 (a machinewith a few MIPS). Throughout the rest of this exercisewe’ll ignore execution speed, but first a tempting diver-sion.Exercise 4. Implement the preflen function in a dif-ferent language. How does its length and speed com-pare to the Awk version?Here is an Awk program to decode the encoded file:{ i = 0 + $1 # int at start of lines = substr(s, 1, i)print s substr($1, length(" " i))}A C version isn’t a lot more complicated; you might enjoywriting it.Table 1 describes the effectiveness of various com-pression schemes. We see that the prefix length encod-ing gives a reduction to about 52% of the original size.Exercise 5. Study the performance of this compres-sion method on your /usr/dict/words. Browse theoutput file, and gather statistics on it.One of the great things about data compressionschemes is that we can ‘‘mix-and-match’’ them. TheSoftware Explorations — Galley 4alphabet now has 37 characters (26 letters, 10 digits andnewline), so we’ll have to go from a 5-bit to a 6-bit code.Table 1 shows that if we use the pipeline preflen |bitcost 6 then we achieve a compression to about39%.Both


View Full Document

Princeton COS 333 - Software Explorations

Download Software Explorations
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Software Explorations and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Software Explorations 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?