LEHIGH CSE 397 - Solving Phylogenetic Trees

Unformatted text preview:

Solving Phylogenetic TreesTable of ContentsPhylogenyDNA Sequence EvolutionProblem DefinitionSo what….Why is that a problem?What do we want?Preparing the inputDistance MatrixSlide 11Lets make it simple (constrain the input)ERROR?!?!!?Data SetsMore Data SetsData UseAspects of TreesWhat can distance tell us?Current TechniquesMore TechniquesLearning Neighbor JoiningNJ Part 1NJ Part 2NJ Part 3Are we done?Here’s what we wantA DCM* - NJ SolutionPhase 1Finding tqWhat does that meanFinding tq (terms)ThresholdSlide 33With q = D15 = 67TriangulatingSlide 36Maximal CliquesTriangulated Threshold GraphCliqueCreate Trees for the CliquesTree {A, B, E} and {C,D}Merge your separate trees together.That sounds like NP-hard!Where are we now?Phase 2SQS - A GuideSQS - RefrasedPowerPoint PresentationSQS MethodHow do we know we’re right?Performance of DCM * - NJImprovementsPerformance gainsComparing ImprovementsBenjamin Loyle 2004 Cse 397Solving Phylogenetic TreesBenjamin LoyleMarch 16, 2004Cse 397 : Intro to MBIOBenjamin Loyle 2004 Cse 397Table of ContentsProblem & Term DefinitionsA DCM*-NJ SolutionPerformance MeasurementsPossible ImprovementsBenjamin Loyle 2004 Cse 397From the Tree of the Life Website,University of ArizonaOrangutanGorilla ChimpanzeeHumanPhylogenyBenjamin Loyle 2004 Cse 397-3 mil yrs-2 mil yrs-1 mil yrstodayAAGACTTTGGACTTAAGGCCTAGGGCAT TAGCCCT AGCACTTAAGGCCT TGGACTTTAGCCCA TAGACTT AGCGCTTAGCACAAAGGGCATAGGGCAT TAGCCCT AGCACTTDNA Sequence EvolutionBenjamin Loyle 2004 Cse 397Problem DefinitionThe Tree of LifeConnecting all living organismsAll encompassingFind evolution from simple beginningsEven smaller relations are toughImpossibleInfer possible ancestral history.Benjamin Loyle 2004 Cse 397So what….Genome sequencing provides entire map of a species, why link them?We can understand evolutionViable drug testing and designPredict the function of genesInfluenza evolutionBenjamin Loyle 2004 Cse 397Why is that a problem?Over 8 million organismsCurrent solutions are NP-hardComputing a few hundred species takes yearsError is a very large factorBenjamin Loyle 2004 Cse 397What do we want?InputA collection of nodes such as taxa or protein strings to compare in a treeOutputA topological link to compare those nodes to each otherWhen do we want it?FAST!Benjamin Loyle 2004 Cse 397Preparing the inputCreate a distance matrixSum up all of the known distances into a matrix sized n x nN is the number of nodes or taxaFound with sequence comparisonBenjamin Loyle 2004 Cse 397Distance MatrixTake 5 separate DNA stringsA : GATCCATGA B : GATCTATGCC : GTCCCATTTD : AATCCGATCE : TCTCGATAGThe distance between A and B is 2 The distance between A and C is 4This is subjective based on what your criteria are.Benjamin Loyle 2004 Cse 397Distance MatrixLets start with an example matrix0 63 94 111 670 79 96 160 47 830 1000ABCDEA B C D EBenjamin Loyle 2004 Cse 397Lets make it simple (constrain the input)Lets keep the distance between nodes within a certain limitFrom F -> GF and G have the largest distance; they are the most dissimilar of any nodes.This is called the diameter of the treeLets keep the length of the input (length of the strings) polynomial.Benjamin Loyle 2004 Cse 397ERROR?!?!!?All trees are inferred, how do you ever know if you’re right?How accurate do we have to be?We can create data sets to test trees that we create and assume that it will then work in the real worldBenjamin Loyle 2004 Cse 397Data SetsJC ModelSites evolve independentSites change with the same probabilityChanges are single character changes•Ie. A -> G or T -> CThe expectation of change is a Poisson variable (e)Benjamin Loyle 2004 Cse 397More Data SetsK2P ModelBased on JC ModelAllows for probability of transitions to tranversions•It’s more likely for A and T to switch and G and C to switch•Normally set to twice as likelyBenjamin Loyle 2004 Cse 397Data UseUsing these data sets we can create our own evolution of data.Start with one “ancestor” and create evolutions Plug the evolutions back and see if you get what you started withBenjamin Loyle 2004 Cse 397Aspects of TreesTopology•The method in which nodes are connected to each other•“Are we really connected to apes directly, or just linked long before we could be considered mammals?”Distance•The sum of the weighted edges to reach one node from anotherBenjamin Loyle 2004 Cse 397What can distance tell us?The distance between nodes IS the evolutionary distance between the nodesThe distance between an ancestor and a leaf(present day object) can be interpreted as an estimate of the number of evolutionary ‘steps’ that occurred.Benjamin Loyle 2004 Cse 397Current TechniquesMaximum ParsimonyMinimize the total number of evolutionary eventsFind the tree that has a minimum amount of changes from ancestorsMaximum LikelihoodProbability basedWhich tree is most probable to occur based on current dataBenjamin Loyle 2004 Cse 397More TechniquesNeighbor JoiningRepeatedly joins pairs of leaves (or subtrees) by rules of numerical optimizationIt shrinks the distance matrix by considering two ‘neighbors’ as one nodeBenjamin Loyle 2004 Cse 397Learning Neighbor JoiningIt will become apparent later on, but lets learn how to do Neighbor Joining (NJ)0 3 3 4 30 3 3 40 3 30 30ABCDEA B C D EBenjamin Loyle 2004 Cse 397NJ Part 1First start with a “star tree”AB CDEBenjamin Loyle 2004 Cse 397NJ Part 2Combine the closest two nodes (from distance matrix)•In our case it is node A and B at distance 3AB CDEBenjamin Loyle 2004 Cse 397NJ Part 3Repeat this until you have added n-2 nodes (3)•N-2 will make it a binary tree, so we only have to include one more node.AB CDEBenjamin Loyle 2004 Cse 397Are we done?ML and MP, even in heuristic form take too long for large data setsNJ has poor topological accuracy, especially for large diameter treesWe need something that works for large diameter trees and can be run fast.Benjamin Loyle 2004 Cse 397Here’s what we wantOur Goal An “Absolute Fast Converging” Method is afc if, for all positive f,g, €, on the Model M, there is a polynomial p such that, for all (T,{(e)}) is in the set Mf,g on a set S of n sequences of length at least p(n) generated on T, we have Pr[(S) = T] > 1- €.•Simply: Lets make it in polynomial time


View Full Document

LEHIGH CSE 397 - Solving Phylogenetic Trees

Download Solving Phylogenetic Trees
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Solving Phylogenetic Trees and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Solving Phylogenetic Trees 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?