Fall 2011 Computational Genomics and Molecular Biology 1Problem Set 4Collaboration is allowed on this homework. You must hand in homeworks individually and list the namesof the people you worked with. Homework must be submitted by 5pm in MI646 or electronically to [email protected] Friday, December 9th1. The Kimura 2-parameter model of sequence evolution distinguishes b etween transitions (purine-purineand pyrimidine-pyrimidine replacements) and transversions (purine-pyrimidine and pyrimidine-purinereplacements). Under the Kimura model, the expected number of sites at which a substitution occurredcan be estimated from the number of mismatching sites byd = −N (0.5 ln(1 − 2ˆp1− ˆp2) + 0.25 ln(1 − 2ˆp2)),where N is the length of the alignment and ˆp1and ˆp2are the number of transitions per site and thenumbe r of transversions per site, res pe ctively.(a) Supp ose you are given two sequences of length 200 that differ by 20 transitions and 4 transversions.i. What is the expected number of substitutions that occurred s ince these sequences dive rgedfrom their common ancestor?ii. What is the difference between the number of observed mismatches and the substitutiondistance estimated by the model?iii. Use the Jukes Cantor model instead of the Kimura 2-parameter model to estimate the ex-pected number of substitutions. What is the expected number of substitutions according tothe JC model?iv. What is the difference between the number of observed mismatches and the substitutiondistance estimated by the model?Fall 2011 Computational Genomics and Molecular Biology 2(b) Suppos e you are given two other sequences of length 200. These sequences differ by 50 transitionsand 16 transversions.i. What is the expected number of substitutions that occurred since these sequences divergedfrom their common ancestor according to the K2P model?ii. What is the difference between the number of observed mismatches and the substitutiondistance estimated by the model?iii. What is the expected number of substitutions that occurred s ince these sequences divergedfrom their common ancestor according to the JC model?iv. What is the difference between the number of observed mismatches and the substitutiondistance estimated by the model?(c) Based on your results, does it matte r which model you use? Is your answer the same for both ofthe above examples?Fall 2011 Computational Genomics and Molecular Biology 32. Consider the following matrix of observe d distances between four taxa, A, B, C and D:B C DA 9 18 19B 19 20C 5(a) Does your matrix fit a tree? How do you know?(b) Are all sequences in this data set changing at the same rate? How do you know?(c) Which of the three unrooted topologies with four leaves is preferred by this distance matrix?(Hint: to find just the preferred topology, without inferring the branch lengths, you do not needto apply an algorithm.)Fall 2011 Computational Genomics and Molecular Biology 43. Under the maximum parsimony criterion, we say a column, or site, in a multiple sequence alignmentis informative, if it favors one tree topology over another. If the parsimony score at a given site in thealignment is the same for all topologies, then the site in uniformative.(a) For each site in the following alignment of sequences from four taxa,1 2 3 4 5 6 7 8 9X. C C G T A G G A CY. A C C T G T G T CZ. A G A T G T G C CW. A G T T A G G C Cstatei. if it is an informative siteii. if so, which of the p os sible tree topologies for four taxa does it favor?iii. if not, what is the parsimony score for this site?(b) Show the most parsimonious tree(s).(c) What is the maximum parsimony score for this data set?Fall 2011 Computational Genomics and Molecular Biology 54. What is the parsimony score of the following tree? Mark mutations on branches and show the inferredset of bases at each internal node.Fall 2011 Computational Genomics and Molecular Biology 65. (a) How many rooted tree topologies are there for seven species, A, B, C, D, E, F and G?(b) Suppos e you know that species A, B, C and D are grouped together in the left subtree and E, Fand G are are grouped together in the right subtree. Under this constraint, how many alternaterooted tree hypotheses are there for A, B, C, D, E, F and
View Full Document