Phylogeny Tree ReconstructionInferring PhylogeniesNeighbor-JoiningAlgorithm: Neighbor-joiningParsimony – What if we don’t have distancesExample: Parsimony cost of one columnParsimony ScoringExampleTraceback to find ancestral nucleotidesSlide 10Probabilistic MethodsSlide 12Felsenstein’s Likelihood AlgorithmSlide 14Current popular methodsMultiple Sequence AlignmentsProtein PhylogeniesProtein Phylogenies – ExampleSlide 19Slide 20DefinitionScoring Function: Sum Of PairsSum Of Pairs (cont’d)A Profile RepresentationSlide 25Multidimensional DPSlide 27Slide 28Slide 29Progressive AlignmentSlide 31Slide 32Heuristics to improve alignmentsIterative RefinementSlide 35Slide 36Slide 37A* for Multiple AlignmentsSlide 39Slide 40ConsistencySlide 42Some ResourcesPhylogeny Tree Reconstruction1432514235CS262 Lecture 13, Win06, BatzoglouInferring PhylogeniesTrees can be inferred by several criteria:Morphology of the organisms•Can lead to mistakes!Sequence comparisonExample:Orc: ACAGTGACGCCCCAAACGTElf: ACAGTGACGCTACAAACGTDwarf: CCTGTGACGTAACAAACGAHobbit: CCTGTGACGTAGCAAACGAHuman: CCTGTGACGTAGCAAACGACS262 Lecture 13, Win06, BatzoglouNeighbor-Joining•Guaranteed to produce the correct tree if distance is additive•May produce a good tree even when distance is not additiveStep 1: Finding neighboring leavesDefineDij = dij – (ri + rj)Where 1 ri = –––––k dik |L| - 2Claim: The above “magic trick” ensures that Dij is minimal iff i, j are neighborsProof: Very technical, please read Durbin et al.!Four-Point Condition: If and only if for every set of four leaves I, j, k, l, two of the distances dij + dkl, dik + djl, dil + djkare equal and larger than the third, the distances are additive and we can reconstruct the tree12430.10.1 0.10.40.4CS262 Lecture 13, Win06, BatzoglouAlgorithm: Neighbor-joiningInitialization:Define T to be the set of leaf nodes, one per sequenceLet L = TIteration:Pick i, j s.t. Dij is minimalDefine a new node k, and set dkm = ½ (dim + djm – dij) for all m LAdd k to T, with edges of lengths dik = ½ (dij + ri – rj), djk = dij – dikRemove i, j from L; Add k to LTermination:When L consists of two nodes, i, j, and the edge between them of length dijCS262 Lecture 13, Win06, BatzoglouParsimony – What if we don’t have distances•One of the most popular methods:GIVEN multiple alignmentFIND tree & history of substitutions explaining alignmentIdea: Find the tree that explains the observed sequences with a minimal number of substitutionsTwo computational subproblems:1. Find the parsimony cost of a given tree (easy)2. Search through all tree topologies (hard)CS262 Lecture 13, Win06, BatzoglouExample: Parsimony cost of one columnABAA{A, B}CostC+=1{A}Final cost C = 1{A}{A}{B}{A}{A}ABAACS262 Lecture 13, Win06, BatzoglouParsimony Scoring Given a tree, and an alignment column uLabel internal nodes to minimize the number of required substitutionsInitialization:Set cost C = 0; node k = 2N – 1 (last leaf)Iteration:If k is a leaf, set Rk = { xk[u] } // Rk is simply the character of kth speciesIf k is not a leaf,Let i, j be the daughter nodes;Set Rk = Ri Rj if intersection is nonemptySet Rk = Ri Rj, and C += 1, if intersection is emptyTermination:Minimal cost of tree for column u, = CCS262 Lecture 13, Win06, BatzoglouExampleA A A B{A} {A} {A} {B}B A BA{A} {B} {A} {B}{A}{A}{A}{A,B}{A,B}{B}{B}CS262 Lecture 13, Win06, BatzoglouTraceback:1. Choose an arbitrary nucleotide from R2N – 1 for the root 2. Having chosen nucleotide r for parent k, If r Ri choose r for daughter iElse, choose arbitrary nucleotide from RiEasy to see that this traceback produces some assignment of cost CTraceback to find ancestral nucleotidesCS262 Lecture 13, Win06, BatzoglouExampleABAB{A, B}{A, B}{A}{A}{B}{A}{B}ABABAA AxxABABAB AxxABABBB BxxAdmissible with TracebackStill optimal, but inadmissible with TracebackCS262 Lecture 13, Win06, BatzoglouProbabilistic MethodsA more refined measure of evolution along a tree than parsimonyP(x1, x2, xroot | t1, t2) = P(xroot) P(x1 | t1, xroot) P(x2 | t2, xroot)If we use Jukes-Cantor, for example, and x1 = xroot = A, x2 = C, t1 = t2 = 1,= pA¼(1 + 3e-4α) ¼(1 – e-4α) = (¼)3(1 + 3e-4α)(1 – e-4α) x1t2xroott1x2CS262 Lecture 13, Win06, BatzoglouProbabilistic Methods•If we know all internal labels xu, P(x1, x2, …, xN, xN+1, …, x2N-1 | T, t) = P(xroot)jrootP(xj | xparent(j), tj, parent(j))•Usually we don’t know the internal labels, thereforeP(x1, x2, …, xN | T, t) = xN+1 xN+2 … x2N-1 P(x1, x2, …, x2N-1 | T, t) xrootx1x2xNxuCS262 Lecture 13, Win06, BatzoglouFelsenstein’s Likelihood AlgorithmTo calculate P(x1, x2, …, xN | T, t)Initialization:Set k = 2N – 1Iteration: Compute P(Lk | a) for all a If k is a leaf node:Set P(Lk | a) = 1(a = xk)If k is not a leaf node:1. Compute P(Li | b), P(Lj | b) for all b, for daughter nodes i, j2. Set P(Lk | a) = b,c P(b | a, ti) P(Li | b) P(c | a, tj) P(Lj | c)Termination:Likelihood at this column = P(x1, x2, …, xN | T, t) = aP(L2N-1 | a)P(a)Let P(Lk | a) denote the prob. of all the leaves below node k, given that the residue at k is aCS262 Lecture 13, Win06, BatzoglouProbabilistic MethodsGiven M (ungapped) alignment columns of N sequences, •Define likelihood of a tree:L(T, t) = P(Data | T, t) = m=1…M P(x1m, …, xnm, T, t)Maximum Likelihood Reconstruction:•Given data X = (xij), find a topology T and length vector t that maximize likelihood L(T, t)CS262 Lecture 13, Win06, BatzoglouCurrent popular methodsHUNDREDS of programs available!http://evolution.genetics.washington.edu/phylip/software.html#methodsSome recommended programs:•Discrete—Parsimony-basedRec-1-DCM3 http://www.cs.utexas.edu/users/tandy/mp.htmlTandy Warnow and colleagues•ProbabilisticSEMPHYhttp://www.cs.huji.ac.il/labs/compbio/semphy/Nir Friedman and colleaguesCS262 Lecture 13, Win06, BatzoglouMultiple Sequence Multiple Sequence AlignmentsAlignmentsCS262 Lecture 13, Win06, BatzoglouProtein Phylogenies•Proteins evolve by both duplication and species divergenceCS262 Lecture 13, Win06, BatzoglouProtein Phylogenies – ExampleCS262 Lecture 13, Win06, BatzoglouCS262 Lecture 13, Win06, BatzoglouCS262 Lecture 13, Win06, BatzoglouDefinition•Given N sequences x1, x2,…, xN:Insert gaps (-) in each sequence xi, such that•All sequences have the same length L•Score of the global map is maximum•A faint similarity between two sequences becomes
View Full Document