UCF CAP 5937 - Computational methods in phylogenetic analysis

Unformatted text preview:

1Computational methods inphylogenetic analysisTutorial at CSB 2004Tandy WarnowReconstructing the “Tree” of LifeHandling large datasets: Handling large datasets: millions of speciesmillions of species2Phylogenetic Inference• Hard optimization problems (e.g. MP, ML)– Better heuristics– Better approximations/lower bounds Relationship between quality of optimization criterion and topological accuracyPhylogenetic Inference, cont.• Bayesian inference • Whole Genome Rearrangements • Reticulate evolution • Processing sets of trees: compact representations and consensus methods • Supertree methods • Statistical issues with respect to stochastic models of evolution (e.g., “fast converging methods”) • Multiple sequence alignment3Major challenge: MP and ML• Maximum Parsimony (MP) and Maximum Likelihood (ML) remain the methods of choice for most systematists• The main challenge here is to make it possible to obtain good solutions to MP or ML in reasonable time periods on large datasetsOutline• Part I (Basics): 40 minutes• Part II (Models of evolution): 20 min.• Part III (Distance-based methods): 30 min.• Part IV (Maximum Parsimony): 30 min.• Part V (Maximum Likelihood): 15 minutes• Part VI (Open problems/research directions): 30 minutes4Part I: Basics (40 minutes)Questions: • What is a phylogeny? • What data are used?• What are the most popular methods?• What is meant by “accuracy”, and how is it measured?• What is involved in a phylogenetic analysis?PhylogenyOrangutanGorilla ChimpanzeeHumanFrom the Tree of the Life Website,University of Arizona5Data• Biomolecular sequences: DNA, RNA, amino acid, in a multiple alignment• Molecular markers (e.g., SNPs, RFLPs, etc.)• Morphology• Gene order and contentThese are “character data”: each character is a function mapping the set of taxa to distinct states (equivalence classes), with evolution modelled as a process that changes the state of a character DNA Sequence EvolutionAAGACTTTGGACTTAAGGCCT-3 mil yrs-2 mil yrs-1 mil yrstodayAGGGCAT TAGCCCT AGCACTTAAGGCCT TGGACTTTAGCCCA TAGACTTAGCGCTTAGCACAAAGGGCATAGGGCAT TAGCCCT AGCACTTAAGACTTTGGACTTAAGGCCTAGGGCAT TAGCCCT AGCACTTAAGGCCT TGGACTTAGCGCTTAGCACAATAGACTTTAGCCCAAGGGCAT6Phylogeny ProblemTAGCCCA TAGACTT TGCACAA TGCGCTTAGGGCATUVWXYUVWXYPhylogenetic Analyses• Step 1: Gather sequence data, and estimate the multiple alignment of the sequences.• Step 2: Reconstruct trees on the data. (This can result in many trees.)• Step 3: Apply consensus methods to the set of trees to figure out what is reliable.7Reconstruction methods• Much software exists, most of which attempt to solve one of two major optimization criteria: Maximum Parsimony and Maximum Likelihood. The most frequently used software package is PAUP*, which contains many different heuristics.• Methods for phylogeny reconstruction are evaluated primarily in simulation studies, based upon stochastic models of evolution.Consensus and agreement methods• Consensus methods take a set of trees on the same set of taxa, and return a single tree on the full set. Standard approaches: strict consensus and majority tree.• Agreement methods take a set of trees on the same set of taxa, and return a single tree on a subset of the taxa. Standard approaches: maximum agreement subtree.• Much new research needs to be done8The Jukes-Cantor model of site evolution• Each “site” is a position in a sequence• The state (i.e., nucleotide) of each site at the root is random• The sites evolve independently and identically (i.i.d.)• If the site changes its state on an edge, it changes with equal probability to the other states• For every edge e, p(e) is defined, which is the probability of change for a random site on the edge e.Methods for phylogenetic inference• Polynomial time methods, mostly based upon estimating evolutionary distances between sequences, and then using them to construct a tree with edge lengths• Heuristics for hard optimization problems (such as maximum parsimony and maximum likelihood)• Bayesian MCMC methods9Additive Distance MatricesDistance-based Phylogenetic Methods10Standard problem: Maximum Parsimony (Hamming distance Steiner Tree)• Input: Set S of n aligned sequences of length k• Output: A phylogenetic tree T– leaf-labeled by sequences in S– additional sequences of length k labeling the internal nodes of Tsuch that is minimized. ∑∈ )(),(),(TEjijiHMaximum parsimony (example)• Input: Four sequences–ACT–ACA–GTT–GTA• Question: which of the three trees has the best MP scores?11Maximum ParsimonyACTGTT ACAGTAACAACTGTAGTTACTACAGTTGTAMaximum ParsimonyACTGTTGTT GTAACAGTA122MP score = 5ACAACTGTAGTTACA ACT313MP score = 7ACTACAGTTGTAACA GTA121MP score = 4Optimal MP tree12Maximum Parsimony: computational complexityACTACAGTTGTAACA GTA121MP score = 4Finding the optimal MP tree is NP-hardOptimal labeling can becomputed in linear time O(nk)Maximum Likelihood (ML)• Given: stochastic model of sequence evolution (e.g. Jukes-Cantor) and a set S of sequences • Objective: Find tree T and probabilities p(e) of substitution on each edge, to maximize the probability of the data.Preferred by some systematists, but even harder than MP in practice.13Bayesian MCMC• Assumes a model of evolution (e.g., Jukes-Cantor)• The basic algorithmic approach is a random walk through the space of model trees, with the probability of the data on the model tree determining whether the proposed new model tree is accepted or rejected. • Statistics on the set of trees visited after “burn-in” constitute the output.Performance criteria for phylogeny reconstruction methods• Speed• Space• Optimality criterion accuracy• “Topological accuracy” (specifically statistical consistency, convergence rate, and performance on finite data)These criteria can be evaluated on real or simulated data.14Evaluating MP heuristics with respect to MP scoresTimeMP scoreof best treesPerformance of Heuristic 1Performance of Heuristic 2Fake studyQuantifying Topological ErrorFN: false negative(missing edge)FP: false positive(incorrect edge)50% error rateFNFP15Statistical performance issues• Statistical consistency: an estimation method is statistically consistent under a model if the probability that the method returns the true tree goes to 1 as the sequence length goes to infinity• Convergence rate: the amount of data that a method needs to return the true tree


View Full Document

UCF CAP 5937 - Computational methods in phylogenetic analysis

Documents in this Course
Load more
Download Computational methods in phylogenetic analysis
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Computational methods in phylogenetic analysis and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Computational methods in phylogenetic analysis 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?