03-511/711 Computational Genomics and Molecular Biology, Fall 2010 1Problem Set 2 Due October 5thCollaboration is allowed on this homework. You must hand in homeworks individually and listthe names of the people you worked with.You may not use a program to do this homework. Turn in your handwritten answers on theattached sheets. Extra alignment templates will be available on the website.1. Progressive alignment is a heuristic method that builds up a series of partial multiple align-ments. At each step, alignments from the previous step are merged to yield a larger multiplealignment, until a multiple alignment containing all sequences in the data set is obtained.The most common approach to merging alignments is to align them as though they weresequences over a larger alphabet. For example, a pairwise global alignment yields a sequenceover the 24 letter alphabe t Σ∗× Σ∗− {()} (i.e., {AA, AC, AG, AT, CA, . . .}). Durbin callsthis a profile.Suppose you want to align a a profile s[1, .., m] with a sequence t[1, .., n], where elements ins[i] are of the form x y, xor y, and elements in t[i] are of the form z, where x, y, z ∈ Σ.(a) Write down the recurrence relation for calculating the alignment matrix a[i, j], for thecase where s[i] is of the form x y, in terms of p(·, ·), g, x, y and t[·].(b) Write down the recurrence relation for calculating a[i, j], for the c ase where s[i] is of theform xor y.03-511/711 Computational Genomics and Molecular Biology, Fall 2010 22. Consider the following multiple sequence ali gnment:A: I R E L S M F N I D MB: L K D M S V W Q I E LC: L N E L C V W Q V E VD: V K D L S L F H V Q V(a) Determine the parsimony score each of the three unrooted topologies for the taxa A, B,C and D, by summing the parsimony scores for each column. Which one(s) are mostparsimonious?(b) How many sites are informative ?(c) Compute the pairwise edit distance for all six pairs in the data set to obtain a pairwisedistance matrix.03-511/711 Computational Genomics and Molecular Biology, Fall 2010 3(d) Does your matrix fit a tree? How do you know?(e) Are all sequences in this data set changing at the same rate? How do you know?(f) Which of the three unrooted topologies with four leaves is preferred by this distancematrix? (Hint: to find just the preferred topology, without inferring the branch lengths,you do not need to apply an algorithm.)(g) Based on comparing your answers for (a) and (b) with your answer in (f), which approachis better for this data set: distances or parsimony? Why?03-511/711 Computational Genomics and Molecular Biology, Fall 2010 43. Consider a tree enumeration scheme for generating all unrooted trees with k leaves thatiterates from i = 4 to i = k. At the i’th step, it generates trees with i leaves by attachingthe ith taxon to each tree with i − 1 leaves at each possible edge. The figure below shows theenumeration process for i = 3, 4 and 5 for a tree with five taxa: A, B, C, D, E.03-511/711 Computational Genomics and Molecular Biology, Fall 2010 5(a) You wish to find all topologies with minimum parsimony score, given the data A = “T”,B = “G”, C = “T”, D = “G”, E = “G”. As a first step, give the parsimony score forthe trees with three leaves and four leaves. Show the internal labels and the edges onwhich change occurs. (There may be more than one set of optimal labels. It is sufficientto show one.)(b) Suppose you have a topology at step i − 1 with score s and you attach the ith taxon toone of its edges. What is the possible range of scores, in terms of s, that the resultingtree may have?(c) Based on your labels of trees of size i = 4, what is the minimum number of topologiesof size 5 must you score in order to ensure that you have found all most parsimonioustrees of size 5?(d) Show the set of all most parsimonious topologies with 5 leaves.03-511/711 Computational Genomics and Molecular Biology, Fall 2010 64. This problem illustrates the strengths and weaknesses of different approaches to tree recon-struction. Remember that while in this case you know the tree from which the distancematrix was derived, when analyzing real data, you will only have the matrix, not the tree.(a) Compute the distance matrix for the following rooted tree:912622126RabbitArmadilloMouseHumanFrog03-511/711 Computational Genomics and Molecular Biology, Fall 2010 7(b) UPGMA and Neighbor Joining (NJ) are greedy algorithms that build trees by iterativelyidentifying neighboring nodes and merging them to form a new subtree. How does theUPGMA algorithm select the taxa that will be neighbors in the next iteration?(c) Working only from the distance matrix (forget you know the tree), which nodes willUPGMA select as the first pair of neighbors? Why?(d) Draw the first subtree with an intermediate node x linking the two neighbors. What arethe branch lengths in this subtree?03-511/711 Computational Genomics and Molecular Biology, Fall 2010 8(e) How does the NJ algorithm select the taxa that will be neighbors in the next iteration?(f) Working only from the distance matrix, which nodes will Neighbor Joining select as thefirst pair of neighbors? Show your calculations for the first iteration of the algorithm.(g) Draw the first subtree constructed by the Neighbor Joining algorithm. What are thebranch lengths in this subtree?(h) Which algorithm is a better choice for reconstructing this tree? Why?(i) How would you determine the root of the reconstructed tree? Describe the procedure indetail.03-511/711 Computational Genomics and Molecular Biology, Fall 2010 95. There are three rooted trees with three leaves:BACCB AA B CFor each of these trees, we can build a tree with four leaves, by adding one more edge at fivedifferent locations:A B CDDA CBDA CBA B CDA B CDA B C03-511/711 Computational Genomics and Molecular Biology, Fall 2010 10(a) Give Ek, the number of edges in a rooted tree with k leaves, and Tk, the number ofrooted tress with k leaves, for k = 3, k = 4 and k = 5.(b) Write down the expression for Ekin terms of k.(c) Write down the recurrence relation for the number of rooted trees with k leaves in termsof Ek−1and Tk−1(d) Write down the expression for Tkin terms of k; i.e., solve your recurrence rel
View Full Document