Phylogeny Tree ReconstructionPhylogenetic TreesParsimony – direct method not using distancesExample: Parsimony cost of one columnParsimony ScoringExampleTraceback to find ancestral nucleotidesSlide 8Multiple Sequence AlignmentsDefinitionScoring Function: Sum Of PairsSum Of Pairs (cont’d)A Profile RepresentationSlide 14Multidimensional DPSlide 16Slide 17Slide 18Progressive AlignmentSlide 20Slide 21Heuristics to improve alignmentsIterative RefinementSlide 24Slide 25Slide 26ConsistencySlide 28Some ResourcesPhylogeny Tree Reconstruction1432514235CS262 Lecture 13, Win07, BatzoglouPhylogenetic Trees•Nodes: species•Edges: time of independent evolution•Edge length represents evolution timeAKA genetic distanceNot necessarily chronological timeCS262 Lecture 13, Win07, BatzoglouParsimony – direct method not using distances•One of the most popular methods:GIVEN multiple alignmentFIND tree & history of substitutions explaining alignmentIdea: Find the tree that explains the observed sequences with a minimal number of substitutionsTwo computational subproblems:1. Find the parsimony cost of a given tree (easy)2. Search through all tree topologies (hard)CS262 Lecture 13, Win07, BatzoglouExample: Parsimony cost of one columnABAA{A, B}CostC+=1{A}Final cost C = 1{A}{A}{B}{A}{A}ABAACS262 Lecture 13, Win07, BatzoglouParsimony Scoring Given a tree, and an alignment column uLabel internal nodes to minimize the number of required substitutionsInitialization:Set cost C = 0; node k = 2N – 1 (last leaf)Iteration:If k is a leaf, set Rk = { xk[u] } // Rk is simply the character of kth speciesIf k is not a leaf,Let i, j be the daughter nodes;Set Rk = Ri Rj if intersection is nonemptySet Rk = Ri Rj, and C += 1, if intersection is emptyTermination:Minimal cost of tree for column u, = CCS262 Lecture 13, Win07, BatzoglouExampleA A A B{A} {A} {A} {B}B A BA{A} {B} {A} {B}{A}{A}{A}{A,B}{A,B}{B}{B}CS262 Lecture 13, Win07, BatzoglouTraceback:1. Choose an arbitrary nucleotide from R2N – 1 for the root 2. Having chosen nucleotide r for parent k, If r Ri choose r for daughter iElse, choose arbitrary nucleotide from RiEasy to see that this traceback produces some assignment of cost CTraceback to find ancestral nucleotidesCS262 Lecture 13, Win07, BatzoglouExampleABAB{A, B}{A, B}{A}{A}{B}{A}{B}ABABAA AxxABABAB AxxABABBB BxxAdmissible with TracebackStill optimal, but inadmissible with TracebackCS262 Lecture 13, Win07, BatzoglouMultiple Sequence Multiple Sequence AlignmentsAlignmentsCS262 Lecture 13, Win07, BatzoglouDefinition•Given N sequences x1, x2,…, xN:Insert gaps (-) in each sequence xi, such that•All sequences have the same length L•Score of the global map is maximum•A faint similarity between two sequences becomes significant if present in many•Multiple alignments reveal elements that are conserved among a class of organisms and therefore important in their common biology•The patterns of conservation can help us tell function of the elementCS262 Lecture 13, Win07, BatzoglouScoring Function: Sum Of Pairs Definition: Induced pairwise alignmentA pairwise alignment induced by the multiple alignmentExample: x: AC-GCGG-C y: AC-GC-GAG z: GCCGC-GAGInduces:x: ACGCGG-C; x: AC-GCGG-C; y: AC-GCGAGy: ACGC-GAC; z: GCCGC-GAG; z: GCCGCGAGCS262 Lecture 13, Win07, BatzoglouSum Of Pairs (cont’d)•Heuristic way to incorporate evolution tree:HumanMouseChicken•Weighted SOP:S(m) = k<l wkl s(mk, ml)DuckCS262 Lecture 13, Win07, BatzoglouA Profile Representation•Given a multiple alignment M = m1…mnReplace each column mi with profile entry pi•Frequency of each letter in •# gaps•Optional: # gap openings, extensions, closingsCan think of this as a “likelihood” of each letter in each position - A G G C T A T C A C C T G T A G – C T A C C A - - - G C A G – C T A C C A - - - G C A G – C T A T C A C – G G C A G – C T A T C G C – G G A 1 1 .8 C .6 1 .4 1 .6 .2G 1 .2 .2 .4 1T .2 1 .6 .2- .2 .8 .4 .8 .4CS262 Lecture 13, Win07, BatzoglouMultiple Sequence AlignmentsAlgorithmsCS262 Lecture 13, Win07, BatzoglouMultidimensional DPGeneralization of Needleman-Wunsh:S(m) = i S(mi)(sum of column scores)F(i1,i2,…,iN): Optimal alignment up to (i1, …, iN) F(i1,i2,…,iN) = max(all neighbors of cube)(F(nbr)+S(nbr))CS262 Lecture 13, Win07, Batzoglou•Example: in 3D (three sequences):•7 neighbors/cellF(i,j,k) = max{ F(i – 1, j – 1, k – 1) + S(xi, xj, xk),F(i – 1, j – 1, k ) + S(xi, xj, - ),F(i – 1, j , k – 1) + S(xi, -, xk),F(i – 1, j , k ) + S(xi, -, - ),F(i , j – 1, k – 1) + S( -, xj, xk),F(i , j – 1, k ) + S( -, xj, - ),F(i , j , k – 1) + S( -, -, xk) } Multidimensional DPCS262 Lecture 13, Win07, BatzoglouRunning Time:1. Size of matrix: LN;Where L = length of each sequence N = number of sequences2. Neighbors/cell: 2N – 1Therefore………………………… O(2N LN)Multidimensional DPCS262 Lecture 13, Win07, BatzoglouRunning Time:1. Size of matrix: LN;Where L = length of each sequence N = number of sequences2. Neighbors/cell: 2N – 1Therefore………………………… O(2N LN)Multidimensional DP•How do gap states generalize?•VERY badly!Require 2N – 1 states, one per combination of gapped/ungapped sequencesRunning time: O(2N 2N LN) = O(4N LN)XY XYZ ZY YZX XZCS262 Lecture 13, Win07, BatzoglouProgressive Alignment•When evolutionary tree is known:Align closest first, in the order of the treeIn each step, align two sequences x, y, or profiles px, py, to generate a new alignment with associated profile presultWeighted version:Tree edges have weights, proportional to the divergence in that edgeNew profile is a weighted average of two old profilesxwyzpxypzwpxyzwCS262 Lecture 13, Win07, BatzoglouProgressive Alignment•When evolutionary tree is known:Align closest first, in the order of the treeIn each step, align two sequences x, y, or profiles px, py, to generate a new alignment with associated profile presultWeighted version:Tree edges have weights, proportional to the divergence in that edgeNew profile is a weighted average of two old profilesxwyzExampleProfile: (A, C, G, T, -)px = (0.8, 0.2, 0, 0, 0)py = (0.6, 0, 0, 0,
View Full Document