DOC PREVIEW
MIT 6 050J - Problem Set #4

This preview shows page 1 out of 3 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 3 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 3 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Massachusetts Institute of TechnologyDepartment of Electrical Engineering and Computer ScienceDepartment of Mechanical Engineering6.050J/2.110J Information and Entropy Spring 2003Issued: February 28, 2003 Problem Set 4 Due: March 7, 2003Problem 1: Meet the Box PeopleA discrete character difference can often be determined by the difference in a single gene. We will refer to theoutwardly detectable expression of the gene as its phenotype (red or blue for example). In higher organisms,genes are represented twice in each cell, as a gene pair. The genes we will be considering can c ome in one oftwo forms: either dominant or recessive. If a dominant gene is one of the genes of a pair, it will always beexpressed. We will represent the dominant form of a gene with an uppercase letter and the recessive formwith a lowercase letter.In this problem, we consider the expression of two genes in an imaginary organism called a box person.We call them the r and c genes. The r gene determines the color of the box person, where red is the dominantphenotype and blue the recessive phenotype. The c gene determines the shape of the box person, wherethe circular shape is the dominant phenotype and the square shape the recessive phenotype. Here we willexamine the progeny (offspring) of two box people. We will assume that both of the parents have RrCcgene combinations. Each child receives one of the mother’s two color genes with equal probability (in thiscase either R or r) and one of the father’s two color genes also with equal probability (again either R or r).For some reason the shape genes act differently: a child receives the dominant gene C only 30% of the time(rather than 50%) from a mother or a father that has one of each. We assume that for each child these fourselections are independent. Below is a chart showing all sixteen possible results for a single child. This chartshows that the recessive gene is expressed in the shape or color of the child only if both of the genes are ofthe recessive type. (When this chart is printed without color, blue usually appears darker than red.)Figure 4–1: The Box People1Problem Set 4 2a. What is the probability that one of the box people’s offspring has...i. a circular phenotype?ii. a square phenotype?iii. a blue phenotype?iv. a red phenotype?b. If you’re told that one of their offspring has a circular phenotype, what’s the probability that theoffspring also has...i. a blue phenotype?ii. a red phenotype?c. If instead of what you were told in part b., you’re told that one of their offspring has a redphenotype, what’s the probability that the offspring also has...i. a box phenotype?ii. a circular phenotype?d. This question relates to another gene of the box people, unrelated to the r and c genes. Unfor-tunately, a mutant gene can turn box people into triangles late in life. A laboratory test hasbeen developed which can spot the gene early so that the dreaded triangle transformation can beprevented by medications. This test is 95 p erc ent accurate at spotting the gene when it is there.However, the test gives a “false positive” 0.4 percent of the time, falsely indicating that a healthybox person has the mutant gene. If 0.1 percent (be careful – that’s one-tenth of one percent) ofthe box people have the mutant gene, what’s the probability that a box person actually has themutant gene if the test indicates that he or she does?Problem 2: Huffman CodingNote: It is not necessary to use MATLAB for this problem; however, you should feel free if you enjoy thechallenge. If you decide to use MATLAB, please place your code in ps4p2.m.You will make use of Huffman coding for this problem. You have been asked to encode a tounge-twisterphrase compactly. This is the sequence of characters, you may recognize it from Problem Set 2:peter piper picked a peck of pickled peppersFor your convenience here is the frequency distribution is listed in Table 4–1.a. One way of coding this sequence would b e to use a fixed-length code, with each code word longenough to encode fourteen different symbols (this is not a Huffman code). How many bits wouldbe needed for this 44-character phrase using such a fixed-length code?b. Determine the theoretical minimum number of bits required to encode the entire phrase (thisis the information content of the phrase), assuming that each character is independent of thesurrounding character. As reference, the equation from class for the average information of asingle symbol is:Xi=1...npilog21pi(4–1)Problem Set 4 3Character # Frequencyp 9 20.46%e 8 18.18%space 7 15.91%c 3 6.82%i 3 6.82%k 3 6.82%r 3 6.82%d 2 4.55%a 1 2.27%f 1 2.27%l 1 2.27%o 1 2.27%s 1 2.27%t 1 2.27%Total 44 100.00%Table 4–1: Frequency distribution of characters in “peter piper picked a peck of pickled pe ppers”c. What is the theoretical contribution of each of the fourteen symbols to this average?d. Derive a codebook for the fourteen symbols using Huffman coding. This codebook will, of course,depend on the frequencies of the symbols in the original phrase.e. When the sequence is encoded using the codebook derived in part d. . . .i. How many bits are needed?ii. How does this compare with the number of bits needed to use the fixed length code ofpart a?iii. How does this compare with the information content of the phrase as calculated in part b?f. An alternate approach to producing a compact code would be to encode the notes as in part a.above and then use LZW lossless compaction. Compare the number of bits needed using LZWwith the numbers derived above in parts a. and e. The LZW approach has the advantage thatthe dictionary does not have to be transmitted. Estimate how many bits are needed to transmitthe codebook if Huffman coding is used.Turning in Your SolutionsYou should have 1 file, your diary (if you chose to use MATLAB for Problem 2, you may have an M-file aswell). Name the diary ps4diary. If you are submitting an M-file for Problem 2, please name it ps4p2.m.You may turn in this problem set by e-mailing your M-files and diary to [email protected]. Do thiseither by attaching them to the e-mail as text files, or by pasting their content directly into the body of thee-mail (if you do the latter, please indicate clearly where each file begins and ends). Alternatively, you mayturn in your solutions on paper in room 38-344. The deadline for submission is the same no matter whichoption you choose.Your solutions are due 5:00 PM on Friday, March 7, 2003.


View Full Document

MIT 6 050J - Problem Set #4

Download Problem Set #4
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Problem Set #4 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Problem Set #4 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?