Phylogeny Tree ReconstructionInferring PhylogeniesModeling EvolutionSlide 4Slide 5Slide 6Phylogeny and sequence comparisonDistance between two sequencesA simple clustering method for building treeAlgorithm: UPGMAUltrametric Distances & UPGMAWeakness of UPGMAAdditive DistancesNeighbor-JoiningAlgorithm: Neighbor-joiningParsimonyParsimony ScoringExampleTraceback to find ancestral nucleotidesSlide 20Search through tree topologies: Branch and BoundBootstrapping to get the best treesProbabilistic MethodsFelsenstein’s Likelihood AlgorithmSlide 25Phylogeny Tree Reconstruction1432514235Inferring PhylogeniesTrees can be inferred by several criteria:Morphology of the organismsSequence comparisonExample:Orc: ACAGTGACGCCCCAAACGTElf: ACAGTGACGCTACAAACGTDwarf: CCTGTGACGTAACAAACGAHobbit: CCTGTGACGTAGCAAACGAHuman: CCTGTGACGTAGCAAACGAModeling EvolutionDuring infinitesimal time t, there is not enough time for two substitutions to happen on the same nucleotideSo we can estimate P(x | y, t), for x, y {A, C, G, T}Then letP(A|A, t) …… P(A|T, t)S(t) = … …P(T|A, t) …… P(T|T, t)Modeling EvolutionReasonable assumption: multiplicative (implying a stationary Markov process)S(t+t’) = S(t)S(t’)That is, P(x | y, t+t’) = z P(x | z, t) P(z | y, t’)Jukes-Cantor: constant rate of evolution 1 - 3 For short time , S() = 1 - 3 1 - 3 1 - 3Modeling EvolutionJukes-Cantor:For longer times,r(t) s(t) s(t) s(t)S(t) = s(t) r(t) s(t) s(t)s(t) s(t) r(t) s(t)s(t) s(t) s(t) r(t)Where we can derive:r(t) = ¼ (1 + 3 e-4t)s(t) = ¼ (1 – e-4t)Modeling EvolutionKimura:Transitions: A/G, C/TTransversions: A/T, A/C, G/T, C/GTransitions (rate ) are much more likely than transversions (rate )r(t) s(t) u(t) s(t)S(t) = s(t) r(t) s(t) u(t)u(t) s(t) r(t) s(t)s(t) u(t) s(t) r(t)Where s(t) = ¼ (1 – e-4t)u(t) = ¼ (1 + e-4t – e-2(+)t)r(t) = 1 – 2s(t) – u(t)Phylogeny and sequence comparisonBasic principles:•Degree of sequence difference is proportional to length of independent sequence evolution•Only use positions where alignment is pretty certain – avoid areas with (too many) gapsDistance between two sequencesGiven (portion of) sequences xi, xj,Define dij = distance between the two sequencesOne possible definition:dij = fraction f of sites u where xi[u] xj[u]Better model (Jukes-Cantor):dij = - ¾ log(1 – 4f / 3)A simple clustering method for building treeUPGMA (unweighted pair group method using arithmetic averages)Given two disjoint clusters Ci, Cj of sequences, 1 dij = ––––––––– {p Ci, q Cj}dpq |Ci| |Cj|Note that if Ck = Ci Cj, then distance to another cluster Cl is: dil |Ci| + djl |Cj| dkl = –––––––––––––– |Ci| + |Cj|Algorithm: UPGMAInitialization:Assign each xi into its own cluster CiDefine one leaf per sequence, height 0Iteration:Find two clusters Ci, Cj s.t. dij is minLet Ck = Ci CjDefine node connecting Ci, Cj, & place it at height dij/2Delete Ci, CjTermination:When two clusters i, j remain, place root at height dij/21432514235Ultrametric Distances & UPGMAUPGMA is guaranteed to build the correct tree if distance is ultrametricProof:1. The tree topology is unique, given that the tree is binary2. UPGMA constructs a tree obeying the pairwise distances14235Weakness of UPGMAMolecular clock: implied time is constant for all speciesHowever, certain species (e.g., mouse, rat) evolve much fasterExample where UPGMA messes up:23411432Correct treeUPGMAAdditive DistancesGiven a tree, a distance measure is additive if the distance between any pair of leaves is the sum of lengths of edges connecting themGiven a tree T & additive distances dij, can uniquely reconstruct edge lengths:•Find two neighboring leaves i, j, with common parent k•Place parent node k at distance dkm = ½ (dim + djm – dij) from any node m12345678910121113d1,4Neighbor-Joining•Guaranteed to produce the correct tree if distance is additive•May produce a good tree even when distance is not additiveStep 1: Finding neighboring leavesDefineDij = dij – (ri + rj)Where 1 ri = –––––k dik |L| - 2Claim: The above “magic trick” ensures that Dij is minimal iff i, j are neighborsProof: Too little time today, please read Durbin et al.!12430.10.1 0.10.40.4Algorithm: Neighbor-joiningInitialization:Define T to be the set of leaf nodes, one per sequenceLet L = TIteration:Pick i, j s.t. Dij is minimalDefine a new node k, and set dkm = ½ (dim + djm – dij) for all m LAdd k to T, with edges of lengths dik = ½ (dij + ri – rj)Remove i, j from L; Add k to LTermination:When L consists of two nodes, i, j, and the edge between them of length dijParsimony•One of the most popular methodsIdea: Find the tree that explains the observed sequences with a minimal number of substitutionsTwo computational subproblems:1. Find the parsimony cost of a given tree (easy)2. Search through all tree topologies (hard)Parsimony Scoring Given a tree, and an alignment columnLabel internal nodes to minimize the number of required substitutionsInitialization:Set cost C = 0; k = 2N – 1Iteration:If k is a leaf, set Rk = { xk[u] }If k is not a leaf,Let i, j be the daughter nodes;Set Rk = Ri Rj if intersection is nonemptySet Rk = Ri Rj, and C += 1, if intersection is emptyTermination:Minimal cost of tree for column u, = CExampleABAB{A, B}C+=1{A, B}C+=1{A}{A}{B}{A}{B}Traceback:1. Choose an arbitrary nucleotide from R2N – 1 for the root 2. Having chosen nucleotide r for parent k, If r Ri choose r for daughter iElse, choose arbitrary nucleotide from RiEasy to see that this traceback produces some assignment of cost CTraceback to find ancestral nucleotidesExampleABAB{A, B}{A, B}{A}{A}{B}{A}{B}ABABAA AxxABABAB AxxABABBB BxxAdmissible with TracebackStill optimal, but inadmissible with TracebackSearch through tree topologies: Branch and BoundObservation: adding an edge to an existing tree can only increase the parsimony costEnumerate all unrooted trees with at most n leaves:[i3][i5][i7]……[i2N–5]]where each ik can take values from 0 (no edge) to kAt each point keep C = smallest cost so far for a complete treeStart B&B with tree [1][0][0]……[0]Whenever cost of current tree T is > C, then:T is not optimalAny tree extending T with more edges is not
View Full Document