Solving Phylogenetic TreesTable of ContentsPhylogenyDNA Sequence EvolutionProblem DefinitionSo what….Why is that a problem?What do we want?Preparing the inputDistance MatrixSlide 11Lets make it simple (constrain the input)ERROR?!?!!?Data SetsMore Data SetsData UseAspects of TreesWhat can distance tell us?Current TechniquesMore TechniquesLearning Neighbor JoiningNJ Part 1NJ Part 2NJ Part 3Are we done?Here’s what we wantA DCM* - NJ SolutionPhase 1Finding tqWhat does that meanFinding tq (terms)ThresholdSlide 33With q = D15 = 67TriangulatingSlide 36Maximal CliquesTriangulated Threshold GraphCliqueCreate Trees for the CliquesTree {A, B, E} and {C,D}Merge your separate trees together.That sounds like NP-hard!Where are we now?Phase 2SQS - A GuideSQS - RefrasedPowerPoint PresentationSQS MethodHow do we know we’re right?Performance of DCM * - NJImprovementsPerformance gainsComparing ImprovementsBenjamin Loyle 2004 Cse 397Solving Phylogenetic TreesBenjamin LoyleMarch 16, 2004Cse 397 : Intro to MBIOBenjamin Loyle 2004 Cse 397Table of ContentsProblem & Term DefinitionsA DCM*-NJ SolutionPerformance MeasurementsPossible ImprovementsBenjamin Loyle 2004 Cse 397From the Tree of the Life Website,University of ArizonaOrangutanGorilla ChimpanzeeHumanPhylogenyBenjamin Loyle 2004 Cse 397-3 mil yrs-2 mil yrs-1 mil yrstodayAAGACTTTGGACTTAAGGCCTAGGGCAT TAGCCCT AGCACTTAAGGCCT TGGACTTTAGCCCA TAGACTT AGCGCTTAGCACAAAGGGCATAGGGCAT TAGCCCT AGCACTTDNA Sequence EvolutionBenjamin Loyle 2004 Cse 397Problem DefinitionThe Tree of LifeConnecting all living organismsAll encompassingFind evolution from simple beginningsEven smaller relations are toughImpossibleInfer possible ancestral history.Benjamin Loyle 2004 Cse 397So what….Genome sequencing provides entire map of a species, why link them?We can understand evolutionViable drug testing and designPredict the function of genesInfluenza evolutionBenjamin Loyle 2004 Cse 397Why is that a problem?Over 8 million organismsCurrent solutions are NP-hardComputing a few hundred species takes yearsError is a very large factorBenjamin Loyle 2004 Cse 397What do we want?InputA collection of nodes such as taxa or protein strings to compare in a treeOutputA topological link to compare those nodes to each otherWhen do we want it?FAST!Benjamin Loyle 2004 Cse 397Preparing the inputCreate a distance matrixSum up all of the known distances into a matrix sized n x nN is the number of nodes or taxaFound with sequence comparisonBenjamin Loyle 2004 Cse 397Distance MatrixTake 5 separate DNA stringsA : GATCCATGA B : GATCTATGCC : GTCCCATTTD : AATCCGATCE : TCTCGATAGThe distance between A and B is 2 The distance between A and C is 4This is subjective based on what your criteria are.Benjamin Loyle 2004 Cse 397Distance MatrixLets start with an example matrix0 63 94 111 670 79 96 160 47 830 1000ABCDEA B C D EBenjamin Loyle 2004 Cse 397Lets make it simple (constrain the input)Lets keep the distance between nodes within a certain limitFrom F -> GF and G have the largest distance; they are the most dissimilar of any nodes.This is called the diameter of the treeLets keep the length of the input (length of the strings) polynomial.Benjamin Loyle 2004 Cse 397ERROR?!?!!?All trees are inferred, how do you ever know if you’re right?How accurate do we have to be?We can create data sets to test trees that we create and assume that it will then work in the real worldBenjamin Loyle 2004 Cse 397Data SetsJC ModelSites evolve independentSites change with the same probabilityChanges are single character changes•Ie. A -> G or T -> CThe expectation of change is a Poisson variable (e)Benjamin Loyle 2004 Cse 397More Data SetsK2P ModelBased on JC ModelAllows for probability of transitions to tranversions•It’s more likely for A and T to switch and G and C to switch•Normally set to twice as likelyBenjamin Loyle 2004 Cse 397Data UseUsing these data sets we can create our own evolution of data.Start with one “ancestor” and create evolutions Plug the evolutions back and see if you get what you started withBenjamin Loyle 2004 Cse 397Aspects of TreesTopology•The method in which nodes are connected to each other•“Are we really connected to apes directly, or just linked long before we could be considered mammals?”Distance•The sum of the weighted edges to reach one node from anotherBenjamin Loyle 2004 Cse 397What can distance tell us?The distance between nodes IS the evolutionary distance between the nodesThe distance between an ancestor and a leaf(present day object) can be interpreted as an estimate of the number of evolutionary ‘steps’ that occurred.Benjamin Loyle 2004 Cse 397Current TechniquesMaximum ParsimonyMinimize the total number of evolutionary eventsFind the tree that has a minimum amount of changes from ancestorsMaximum LikelihoodProbability basedWhich tree is most probable to occur based on current dataBenjamin Loyle 2004 Cse 397More TechniquesNeighbor JoiningRepeatedly joins pairs of leaves (or subtrees) by rules of numerical optimizationIt shrinks the distance matrix by considering two ‘neighbors’ as one nodeBenjamin Loyle 2004 Cse 397Learning Neighbor JoiningIt will become apparent later on, but lets learn how to do Neighbor Joining (NJ)0 3 3 4 30 3 3 40 3 30 30ABCDEA B C D EBenjamin Loyle 2004 Cse 397NJ Part 1First start with a “star tree”AB CDEBenjamin Loyle 2004 Cse 397NJ Part 2Combine the closest two nodes (from distance matrix)•In our case it is node A and B at distance 3AB CDEBenjamin Loyle 2004 Cse 397NJ Part 3Repeat this until you have added n-2 nodes (3)•N-2 will make it a binary tree, so we only have to include one more node.AB CDEBenjamin Loyle 2004 Cse 397Are we done?ML and MP, even in heuristic form take too long for large data setsNJ has poor topological accuracy, especially for large diameter treesWe need something that works for large diameter trees and can be run fast.Benjamin Loyle 2004 Cse 397Here’s what we wantOur Goal An “Absolute Fast Converging” Method is afc if, for all positive f,g, €, on the Model M, there is a polynomial p such that, for all (T,{(e)}) is in the set Mf,g on a set S of n sequences of length at least p(n) generated on T, we have Pr[(S) = T] > 1- €.•Simply: Lets make it in polynomial time
View Full Document