UW CSEP 590 - Computational Biology - D1981950

Home> Schools> University of Washington> (CSEP) > CSEP 590> Computational Biology

DOC PREVIEW

UW CSEP 590 - Computational Biology

School name University of Washington

Course Csep 590- Special Topics In Computer Science (PMP)

Pages 2

This preview shows page 1 out of 2 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 2 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 2 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

CSEP 590A: Computational Biology Assignment #3due: Thursday, July 27Due date is set for 7/27, but lecture five will make more sense if you have time to start before7/20, and I will likely have some more to hand out this week. Turn this one in on paper; handwrittenis fine, I don’t recommend trying to typeset it. Extra credit is for extra practice and glory; it isnot a big component of your grade.1. Bayes Rule: In a certain population, an obese person has a 30 percent chance of havinghigh blood pressure and a non-obese person has a 10 percent chance of having high bloodpressure. Twenty percent of the p opulation is obese. What is the conditional probability thata person is obese, given that the person has high blood pressure?2. Maximum Likelihood: Let x1, x2, . . . , xnbe n sam ples of a normal random variable Xwith mean θ1and variance θ2. In class I showed that the maximum likelihood estimates ofθ1and θ2when both are unknown give a biased estimate of θ2. What is the MLE of θ2= σ2if θ1= µ is assumed to b e known? Extra Credit: Is it biased, i.e., does the expected value ofˆθ2differ from θ2?3. EM: In class, I sketched the EM algorithm for the two-component Gaussian Mixture Modelonly in the special case when both subpopulations were assumed to share the same varianceand the mixing proportions were assumed to be 50/50. Carry out the analysis for the generalcase where σ21, σ22and 0 ≤ τ1≤ 1 (τ2= 1 − τ1) are arbitrary.4. Maximum Likelihood: Suppose X is a discrete random variable with three possible out-comes, say A1, A2and A3. Let θ = (p1, p2, p3) be the probabilities of outcomes A1, A2, A3,resp., (where p1+ p2+ p3= 1, of course). Suppose you have collected n independent randomsamples x1, x2, . . . , xndrawn from this distribution. Using the same basic approach as in thecoin-flipping example in the class notes (Lec 4, slide 9), show that the maximum likelihoodestimators for the parameters θ areˆθ = (n1/n, n2/n, n3/n), where niis the number of occur-rences of outcome Aiamong x1, x2, . . . , xn. Hint: The algebra is mildly easier if you happento remember Lagrange multipliers, but it’s certainly not essential. (FYI, this result general-izes to arbitrary multinomial distributions, not just 2 or 3 outcomes; see the slick proof inChapter 11.)5. EM: Recall that an allele of a gene is one variant of its DNA or protein sequence. Individualsgenerally carry two (possibly identical) alleles of each gene, one inherited from mother, onefrom father (genes on the X/Y chromosomes being exceptions). The ABO blood type gene hasthree common alleles in the human population: A, B and O. The blood type of an individualdepends as follows on the pair of alleles that he or she has: type A if the pair is A/A or A/O;type B if the pair is B/B or B/O; type AB if the pair is A/B; type O if the pair is O/O. Letp(A) be the fraction of A alleles in the population, p(B), the fraction of B alleles and p(O),the fraction of O alleles. These fractions are nonnegative and sum to 1. Under the standardassumption in genetics of independent assortment, the probability that an individual has agiven pair of alleles is the same as the probability of obtaining that pair in two random drawsfrom the set of all alleles in the population: for example, the probability of the pair A/B is2p(A)p(B). In a sample of 20 individuals, 9 have bloo d type A, 2 have blood type B, 1 hasblood type AB and 8 have blood type O. Derive the appropriate formulas needed to use theEM algorithm to determine the values of p(A), p(B) and p(O) mos t likely to have given rise1to this data. Then run the algorithm for a few iterations on the given data. Try it with acouple of very different starting estimates for the parameters. You may write a program to dothe iteration, do it by hand, or give a spreadsheet with the relevant formulas and “fill down”a few rows to iterate. If you use a spreadsheet, turn in a printout of the formulas as well asthe numbers; I think CONTROL-backquote causes Excel to show all formulas. Hint: Theparameters are p(A), p(B) and p(O), the observed data are the blood types of the individualsand the hidden data are the pairs of alleles possessed by the individuals. The solution toproblem 4 will help. Depending on how you set up the likelihood function, you might (ormight not) need the multinomial distribution from pg 300 of the text.(If you’d like info on the genetics of the ABO blood group sys-tem, the 1930 Nobel prize in Physiology or Medicine, have a look atWikipedia http://en.wikipedia.org/wiki/Abo_blood_group or OMIMhttp://www.ncbi.nlm.nih.gov/entrez/dispomim.cgi?id=110300. In a nutshell, they are3 alleles of a single gene on the ninth chromosome (9q34), which encodes a glycosyltransferase- an enzyme that modifies the carbohydrate content of the red blood cell antigens. The Aand B alleles perform slightly (but immunologically significantly) different modifications; theO allele has a 1 base deletion, hence an altered reading frame, producing a very differentprotein with no apparent function at all, a so-called “null” allele. Aside from issues withblood transfusions, people with O blood type are apparently more susceptible to cholera.And, no, the “independent assortment” assumption for this gene is not well justified in thehuman population; prevalence is strongly dependent on geography. But we’ll ignore that forthis problem...)Extra Credit Problems:6. Maximum Likelihood: Suppose X is a random variable uniformly distributed between 0and θ > 0 for some unknown θ. Based on a sample x1, x2, . . . , xnof X, what is the maximumlikelihood estimator of θ? Is it biased?7. EM: Generalize the EM algorithm from problem 3 to allow a fixed but arbitrary numberk ≥ 1 of components in the mixture, preferably allowing a choice of either a common varianceσ2shared by all clusters, or a separate variance per cluster. Implement it and expe rimentwith simulated data to see how well it recovers the parameters you used to generate the data.How quickly does the iteration converge? Does it ever seem to be converging to a local,not global, max? How well does it work with sparse data? Well-separated clusters? Highlyoverlapping

View Full Document


School:
Email:
New Password:
Confirm Password:

This preview shows page 1 out of 2 pages.

UW CSEP 590 - Computational Biology

Sign up for free to view:

Please select your school