CMU BSC 03711 - Distance matrix methods - D502312

Home> Schools> Carnegie Mellon University> Biological Sciences (BSC) > BSC 03711> Distance matrix methods

DOC PREVIEW

CMU BSC 03711 - Distance matrix methods

School name Carnegie Mellon University

Course Bsc 03711- Computational Molecular Biology and Genomics

Pages 25

This preview shows page 1-2-24-25 out of 25 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 25 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 25 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 25 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 25 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 25 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Chapter 11 Distance matrix methods A major family of phylogenetic methods has been the distance matrix methods, intro-duced by Cavalli-Sforza and Edwards (1967) and by Fitch and Margoliash (1967; see also Horne, 1967). They were influenced by the clustering algorithms of Sokal and Sneath (1963). The general idea seems as if it would not work very well: cal-culate a measure of the distance between each pair of species, and then find a tree that predicts the observed set of distances as closely as possible. This leaves out all information from higher-order combinations of character states, reducing the data matrix to a simple table of pairwise distances. One would think that this must leave out so many of the subtleties of the data that it could not possibly do a reasonable job of making an estimate of the phylogeny. Computer simulation studies show that the amount of information about the phylogeny that is lost in doing this is remarkably small. The estimates of the phy-logeny are quite accurate. Apparently, it is not common for evolutionary processes (at least not the simple models that we use for them) to leave a trace in high-order combinations of character states without also leaving almost the same information in the pairwise distances between the species. The best way of thinking about distance matrix methods is to consider dis-tances as estimates of the branch length separating that pair of species. Each dis-tance infers the best unrooted tree for that pair of species. In effect, we then have a large number of (estimated) two-species trees, and we are trying to find the n-species tree that is implied by these. The difficulty in doing this is that the indi-vidual distances are not exactly the path lengths in the full n-species tree between those two species. They depart from it, and we need to find the full tree that does the best job of approximating these individual two-species trees. Branch lengths and times In distance matrix methods, branch lengths are not simply a function of time. They reflect expected amounts of evolution in different branches of the tree. Two 147148 Chapter 11 branches may reflect the same elapsed time (as when they are sister lineages in a rooted phylogeny), but they can have different expected amounts of evolution. In effect, each branch has a length that is a multiple Ti of the elapsed time ti. The product Tit; is the branch length. This allows different branches to have different rates of evolution. The least squares methods We start by describing the least squares methods,which are some of the best-justified ones statistically. The distances themselves also need some discussion, as they must have particular mathematical and statistical properties to work with these methods. We also describe one variant, the minimum evolution methods, and two quicker but more approximate distance matrix methods: UPGMA clustering and the neighbor-joining method. The fundamental idea of distance matrix methods is that we have an observed table (matrix) of distances (Di)), and that any particular tree that has branch lengths leads to a predicted set of distances (which we will denote the di j ). It does so by making the prediction of the distance between two species by adding up the branch lengths between the two species. Figure 11.1 shows a tree and the distance matrix that it predicts. We also have a measure of the discrepancy be-tween the observed and the expected distances. The measure that is used in the least squares methods is n nQ L L Wij (Di) - d·ij )2 (11.1) i=! J=! where the Wij are weights that differ between different least squares methods. Cavalli-Sforza and Edwards (1967) defined the unweighted least squares method in which Wij = 1. Fitch and Margoliash (1967) used Wij = 1/»; and Beyer et al. (1974) suggested Wij = 1/Di j . We are searching for the tree topology and the branch lengths that minimize Q. For any given tree topology it is possible to solve for the branch lengths that minimize Qby standard least squares methods. The summation in equation 11.1 is over all combinations of i and i. Note that when i = j, both the observed and the predicted distances are zero, so that no contribution is made to Q. One can alternatively sum over only those j for which j of. i. Least squares branch lengths To find the branch lengths on a tree of given topology using least squares we must minimize Q. The expression for Q in equation 11.1 is a quadratic in the branch lengths. One way that it can be minimized is to solve a set of linear equations. These are obtained by taking derivatives of Qwith respect to the branch lengths, and equating those to zero. The solution of the resulting equations will minimizeDistance matrix methods 149 B Figure 11.1: A tree and the distances it predicts, which are generated by adding up the lengths of branches between each pair of species. Q. In equation 11.1 the dij are sums of branch lengths. Figure 11.2 shows the same tree with variables for the branch lengths. If the species are numbered in alphabetic order, dI 4 will be the expected distance between species A and D, so that it is VI + V7 +V4. The expected distance between species Band E is V2,5 = V:" +V6 +V7 + V2· Suppose that we number all the branches of the tree and introduce an indicator variable Xij,", which is 1 if branch k lies in the path from species i to species j and ootherwise. The expected distance between i and j will then be a., = L Xij,k Vk' (11.2) k' Equation 11.1 then becomes Q = tL Wij (D'j -LXi,i,kVk) 2 (11.3) 7=1 ] :.1'1', k150 Chapter 11 B V2 D Figure 11.2: The same tree as the previous figure, with the branch lengths as variables. Ifwe differentiate Q with respect to one of the v's such as Vb and equate the deriva-tive to zero, we get the equation dQ 11 ( ) o (11.4)dVk = -28j;; Wij Di) -~Xij.kVk The -2 may be discarded. One way to make a least squares estimate of branch lengths is to solve this set of linear equations. There are both exact and iterative methods for doing this. In the case of Cavalli-Sforza and Edwards's original unweighted least squares methods, where the weights Wi) are all I, the equations are particularly simple. This will lead us to a nice matrix form, and the more general case can then be put in that form. (The reader who is prone to panic attacks at the sight of matrices should skip the rest of this subsection and the one on generalized least squares as well.) For the

View Full Document