03-511/711 Computational Genomics and Molecular Biology, Fall 2009 1Problem Set 2 Due October 6thCollaboration is allowed on this homework. You must hand in homeworks individually and listthe names of the people you worked with.1. Given the distance function, M = 0, m = 1, g = 1, what is the relationship (<, ≤, =,etc.) between the quantities X and Y in each of the following cases? Gi ve a one sentenceexplanation of your answer.(a) X = the sum-of-pairs (SP) score of a multiple sequence alignment.Y = the tree alignment score of the same multiple sequence alignment.(b) X = the SP sc ore of a multiple sequence alignment obtained by dynamic programming.Y = the SP score of a multiple alignment of the same sequences obtained by progressivealignment.(c) X = the SP score of the optimal multiple alignment of sequences S1. . . Sk.Y =Pki=1Pkj>iD(Si, Sj), where D(Si, Sj) is the optimal pairwise score between seq-uences Siand Sj.03-511/711 Computational Genomics and Molecular Biology, Fall 2009 22. Under the maximum parsimony criterion, we say a column, or site, in a multiple sequencealignment is informative, if it favors one tree topology over another. If the parsimony scoreat a given site in the alignment is the same for all topologies, then the site in uniformative.(a) For each site in the following alignment of sequences from four taxa,1 2 3 4 5 6 7 8 9X. C C G T A G G A CY. A C C T G T G T CZ. A G A T G T G C CW. A G T T A G G C Cstatei. if it is an informative siteii. if so, which of the possible tree top ol ogies for four taxa does it favor?iii. if not, what is the parsimony score for this site?(b) Show the most parsimonious tree(s).(c) What is the maximum parsimony score for this data set?03-511/711 Computational Genomics and Molecular Biology, Fall 2009 33. What is the parsimony score of the following tree? Show your work.A CCGAT GTCG03-511/711 Computational Genomics and Molecular Biology, Fall 2009 44. Consider a pair of DNA sequence that are changing according to the Jukes-Cantor model.The residues in these sequences are changing at two different rates: Half the sites change atα1= 8 · 10−4changes per site per million years, while the other half of the sites are changingfive times as fast (α2= 5α1). Unfortunately, we don’t know which of the sites are changingat the fast rate.(a) Give an expression for Pmismatch, the expected fraction of sites with observable differ-ences between two sequences, as a function of α1, α2and t.(b) Suppose we incorrectly assume that all sites are following a Jukes-Cantor model ofchange with the same parameter α = (α1+ α2)/2. Give an expression forˆPmismatch,the expected fraction of observable differences as a function of α and t in this case.(c) Use your favorite plotting program, plotˆPmismatchand Pmismatchas a function of t,where t, given in units of millions of years, varies from 0 to 3000.(d) What are the consequences of ignoring rate variation and using the average rate instead?How does this effect vary with t? Can you explain why the impact is higher at somevalues of t than at others?03-511/711 Computational Genomics and Molecular Biology, Fall 2009 55. This problem illustrates the strengths and weaknesses of different distance-based methodsfor tree reconstruction. Remember that while in this case you know the tree from which thedistance matrix was derived, when analyzing real data, you will only have the matrix, notthe tree.(a) Compute the distance matrix for the following rooted tree: A, B, C, D and E:CBADE101252210503-511/711 Computational Genomics and Molecular Biology, Fall 2009 6(b) UPGMA and Neighbor Joining (NJ) are greedy algorithms that build trees by iterativelyidentifying neighboring nodes and merging them to form a new subtree. How does theUPGMA algorithm select the taxa that will be neighbors in the next iteration?(c) Working only from the distance matrix (forget you know the tree), which nodes willUPGMA select as the first pair of neighbors? Why?(d) Draw the first subtree with an intermediate node x linking the two neighbors. What arethe branch lengths in this subtree?03-511/711 Computational Genomics and Molecular Biology, Fall 2009 7(e) How does the NJ algorithm select the taxa that will be neighbors in the next iteration?(f) Working only from the distance matrix, which nodes will Neighbor Joining select as thefirst pair of neighbors? Show your calculations for the first iteration of the algorithm.(g) Draw the first subtree constructed by the Neighbor Joining algorithm. What are thebranch lengths in this subtree?(h) Which algorithm is a better choice for reconstructing this tree? Why?03-511/711 Computational Genomics and Molecular Biology, Fall 2009 86. (a) Consider the following matrix of observed distances between four taxa, A, B, C and D:B C DA 21 6 42B 26 61C 45You wish to reconstruct a tree for the above matrix. Which algorithm would you use toreconstruct the topology and why? How would you determine the location of the root?(b) Consider the following matrix of observed distances between four taxa, A, B, C and D:B C DA 5 8 35B 12 26C 22You wish to reconstruct a tree for the above matrix. Which algorithm would you use toreconstruct the topology and why? How would you determine the location of the root?03-511/711 Computational Genomics and Molecular Biology, Fall 2009 9(c) Consider the following matrix of observed distances between four taxa, A, B, C and D:B C DA 9 18 19B 19 20C 5You wish to reconstruct a tree for the above matrix. Which algorithm would you use toreconstruct the topology and why? How would you determine the location of the
View Full Document