Phylogeny Tree ReconstructionInferring PhylogeniesModeling EvolutionSlide 4Slide 5Slide 6Phylogeny and sequence comparisonDistance between two sequencesA simple clustering method for building treeAlgorithm: Average LinkageExampleUltrametric Distances and Molecular ClockUltrametric Distances & Average LinkageWeakness of Average LinkageAdditive DistancesSlide 16Reconstructing Additive Distances Given TSlide 18Slide 19Slide 20Neighbor-JoiningAlgorithm: Neighbor-joiningParsimonyParsimony ScoringSlide 25Slide 26Traceback to find ancestral nucleotidesSlide 28Number of labeled unrooted tree topologiesSlide 30Slide 31Slide 32Slide 33Slide 34Search through tree topologies: Branch and BoundBootstrapping to get the best treesProbabilistic MethodsSlide 38Felsenstein’s Likelihood AlgorithmSlide 40Phylogeny Tree Reconstruction1432514235Inferring PhylogeniesTrees can be inferred by several criteria:Morphology of the organismsSequence comparisonExample:Orc: ACAGTGACGCCCCAAACGTElf: ACAGTGACGCTACAAACGTDwarf: CCTGTGACGTAACAAACGAHobbit: CCTGTGACGTAGCAAACGAHuman: CCTGTGACGTAGCAAACGAModeling EvolutionDuring infinitesimal time t, there is not enough time for two substitutions to happen on the same nucleotideSo we can estimate P(x | y, t), for x, y {A, C, G, T}Then letP(A|A, t) …… P(A|T, t)S(t) = … …… …P(T|A, t) …… P(T|T, t)xxytModeling EvolutionReasonable assumption: multiplicative (implying a stationary Markov process)S(t+t’) = S(t)S(t’)That is, P(x | y, t+t’) = z P(x | z, t) P(z | y, t’)Jukes-Cantor: constant rate of evolution 1 - 3 For short time , S() = I+R = 1 - 3 1 - 3 1 - 3A CGTModeling EvolutionJukes-Cantor:For longer times,r(t) s(t) s(t) s(t)S(t) = s(t) r(t) s(t) s(t)s(t) s(t) r(t) s(t)s(t) s(t) s(t) r(t)Where we can derive:r(t) = ¼ (1 + 3 e-4t)s(t) = ¼ (1 – e-4t)S(t+) = S(t)S() = S(t)(I + R)Therefore,(S(t+) – S(t))/ = S(t) RAt the limit of 0,S’(t) = S(t) REquivalently,r’ = -3r + 3ss’ = -s + rThose diff. equations lead to:r(t) = ¼ (1 + 3 e-4t)s(t) = ¼ (1 – e-4t)Modeling EvolutionKimura:Transitions: A/G, C/TTransversions: A/T, A/C, G/T, C/GTransitions (rate ) are much more likely than transversions (rate )r(t) s(t) u(t) s(t)S(t) = s(t) r(t) s(t) u(t)u(t) s(t) r(t) s(t)s(t) u(t) s(t) r(t)Where s(t) = ¼ (1 – e-4t)u(t) = ¼ (1 + e-4t – e-2(+)t)r(t) = 1 – 2s(t) – u(t)Phylogeny and sequence comparisonBasic principles:•Degree of sequence difference is proportional to length of independent sequence evolution•Only use positions where alignment is pretty certain – avoid areas with (too many) gapsDistance between two sequencesGiven sequences xi, xj,Define dij = distance between the two sequencesOne possible definition:dij = fraction f of sites u where xi[u] xj[u]Better model (Jukes-Cantor):f = ¾ (1 – e-4t) ¾ e-4t = ¾ - f log (e-4t) = log (1 – 4/3 f)dij = t = - ¼ -1 log(1 – 4/3 f)A simple clustering method for building treeUPGMA (unweighted pair group method using arithmetic averages)Or the Average Linkage MethodGiven two disjoint clusters Ci, Cj of sequences, 1dij = ––––––––– {p Ci, q Cj}dpq|Ci| |Cj|Claim that if Ck = Ci Cj, then distance to another cluster Cl is:dil |Ci| + djl |Cj| dkl = –––––––––––––– |Ci| + |Cj|Proof Ci,Cl dpq + Cj,Cl dpq dkl = –––––––––––––––– (|Ci| + |Cj|) |Cl| |Ci|/(|Ci||Cl|) Ci,Cl dpq + |Cj|/(|Cj||Cl|) Cj,Cl dpq = –––––––––––––––––––––––––––––––––––– (|Ci| + |Cj|) |Ci| dil + |Cj| djl= ––––––––––––– (|Ci| + |Cj|)Algorithm: Average LinkageInitialization:Assign each xi into its own cluster CiDefine one leaf per sequence, height 0Iteration:Find two clusters Ci, Cj s.t. dij is minLet Ck = Ci CjDefine node connecting Ci, Cj, & place it at height dij/2Delete Ci, CjTermination:When two clusters i, j remain, place root at height dij/21432514235Examplev w x y zv0 6 8 8 8w0 8 8 8x0 4 4y0 2z0y zxwv1234v w x yzv0 6 8 8w0 8 8x0 4yz0v w xyzv0 6 8w0 8xyz0vw xyzvw0 8xyz0Ultrametric Distances and Molecular ClockDefinition:A distance function d(.,.) is ultrametric if for any three distances dij dik dij, it is true that dij dik = dijThe Molecular Clock:The evolutionary distance between species x and y is 2 the Earth time to reach the nearest common ancestorThat is, the molecular clock has constant rate in all species1 4 2 3 5yearsThe molecular clock results in ultrametric distancesUltrametric Distances & Average LinkageAverage Linkage is guaranteed to reconstruct correctly a binary tree with ultrametric distancesProof: Exercise (extra credit)1 4 2 35Weakness of Average LinkageMolecular clock: all species evolve at the same rate (Earth time)However, certain species (e.g., mouse, rat) evolve much fasterExample where UPGMA messes up:23411432Correct treeAL treeAdditive DistancesGiven a tree, a distance measure is additive if the distance between any pair of leaves is the sum of lengths of edges connecting themGiven a tree T & additive distances dij, can uniquely reconstruct edge lengths:•Find two neighboring leaves i, j, with common parent k•Place parent node k at distance dkm = ½ (dim + djm – dij) from any node m12345678910121113d1,4Additive DistancesFor any four leaves x, y, z, w, consider the three sums d(x, y) + d(z, w)d(x, z) + d(y, w)d(x, w) + d(y, z)One of them is smaller than the other two, which are equald(x, y) + d(z, w) < d(x, z) + d(y, w) = d(x, w) + d(y, z)xyzwReconstructing Additive Distances Given Txyzwv5473346v w x y zv0 10 17 16 16w0 15 14 14x0 9 15y0 14z0TIf we know T and D, but do not know the length of each leaf, we can reconstruct those lengthsDReconstructing Additive Distances Given Txyzwvv w x y zv0 10 17 16 16w0 15 14 14x0 9 15y0 14z0TDReconstructing Additive Distances Given Txyzwvv w x y zv0 10 17 16 16w0 15 14 14x0 9 15y0 14z0TDa x y za0 11 10 10x0 9 15y0 14z0aD1dax = ½ (dvx + dwx – dvw)day = ½ (dvy + dwy – dvw)daz = ½ (dvz + dwz – dvw)Reconstructing Additive Distances Given TxyzwvTa x y za0 11 10 10x0 9 15y0 14z0aD1a b za0 6 10b0 10z0D2bca ca0 3c0D3d(a, c) = 3d(b, c) =
View Full Document