UCF CAP 5937 - Challenges in constructing very large evolutionary trees - D3047630

Home> Schools> University of Central Florida> Computer Applications (CAP) > CAP 5937> Challenges in constructing very large evolutionary trees

UCF CAP 5937 - Challenges in constructing very large evolutionary trees

School name University of Central Florida

Course Cap 5937- Medical Image Computing

Pages 24

Download Save

Unformatted text preview:

1Challenges in constructing very large evolutionary treesTandy WarnowRadcliffe Institute for Advanced StudyUniversity of Texas at AustinPhylogenyFrom the Tree of the Life Website,University of ArizonaOrangutanGorilla ChimpanzeeHuman2Ringe-Warnow Phylogenetic Tree of Indo-European Major methods for phylogeny reconstruction • Biology: Polynomial time methods (good enough for small datasets), and local search heuristics for NP-hard optimization problems • Linguistics: exact algorithms for NP-hardoptimization problems3Evolution informs about everything in biology• Big genome sequencing projects just produce data -- so what?• Evolutionary history relates all organisms and genes, and helps us understand and predict – interactions between genes (genetic networks)– drug design– predicting functions of genes– influenza vaccine development– origins and spread of disease– origins and migrations of humansMain research foci• Solving maximum parsimony and maximum likelihood more effectively• “Fast converging methods”• Gene order and content phylogeny• Reticulate evolution• Phylogenetic multiple sequence alignment4Gene Order/Content Phylogeny• Group leader: Bernard Moret• Software: (1) simulating genome evolution on trees (2) GRAPPA: Genome Rearrangement Analysis using Parsimony and other Phylogenetic Algorithms• Currently limited to equal content genomes• Ongoing research: handling unequal gene contentReticulate Evolution5DNA Sequence EvolutionAAGACTTTGGACTTAAGGCCT-3 mil yrs-2 mil yrs-1 mil yrstodayAGGGCAT TAGCCCT AGCACTTAAGGCCT TGGACTTTAGCCCA TAGACTTAGCGCTTAGCACAAAGGGCATAGGGCAT TAGCCCT AGCACTTAAGACTTTGGACTTAAGGCCTAGGGCAT TAGCCCT AGCACTTAAGGCCT TGGACTTAGCGCTTAGCACAATAGACTTTAGCCCAAGGGCATMolecular SystematicsTAGCCCA TAGACTT TGCACAA TGCGCTTAGGGCATUVWXYUVWXY6Basic challenges in molecularphylogenetics• Most favored approaches attempt to solve hard optimization problems such as maximum parsimony and maximum likelihood - can we design better methods?• DNA sequence evolution may be too “noisy” -perhaps we need new types of data?• Many equally good solutions for a given dataset -how can we figure out “truth”?• Not all evolution is tree-like - how can we detect and infer reticulate evolution?Some of our projects• Divide-and-conquer strategies for maximum parsimony and maximum likelihood• Using “rare genomic changes” for deep evolution• Consensus/clustering methods for sets of optimal trees• Detection and reconstruction of reticulate evolution(All projects are joint with biologists and computer scientists at various universities, and are part of the new ITR grant)7Coping with NP-hard problemsSince NP-hard problems may not be solvable in polynomial time, the options are:– Solve the problem exactly (but use lots of time on some inputs)– Use heuristics which may not solve the problem exactly (and which might be computationally expensive, anyway)General comments for NP-hard optimization problems• Getting exact solutions may not be possible for some problems on some inputs, without spending a great deal of time.• You may not know when you have an optimal solution, if you use a heuristic.• Sometimes exact solutions may not be necessary, and approximate solutions may suffice. (But this may not be true for biology.)8DNA Sequence EvolutionAAGACTTTGGACTTAAGGCCT-3 mil yrs-2 mil yrs-1 mil yrstodayAGGGCAT TAGCCCT AGCACTTAAGGCCT TGGACTTTAGCCCA TAGACTTAGCGCTTAGCACAAAGGGCATAGGGCAT TAGCCCT AGCACTTAAGACTTTGGACTTAAGGCCTAGGGCAT TAGCCCT AGCACTTAAGGCCT TGGACTTAGCGCTTAGCACAATAGACTTTAGCCCAAGGGCATMajor phylogeny reconstruction methods• In biology: mostly hill-climbing heuristics that attempt to solve NP-hard optimization problems (maximum parsimony or maximum likelihood)• In historical linguistics: much less is established, but an exact solution to an NP-hard problem looks very promising.9Maximum ParsimonyACTGTT ACAGTAACAACTGTAGTTACTACAGTTGTAMaximum ParsimonyACTGTTGTT GTAACAGTA122MP score = 5ACAACTGTAGTTACA ACT313MP score = 7ACTACAGTTGTAACA GTA121MP score = 4Optimal MP tree10Maximum Parsimony: computational complexityACTACAGTTGTAACA GTA121MP score = 4Finding the optimal MP tree is NP-hardOptimal labeling can becomputed in linear time O(nk)Maximum Parsimony• Given a set S of strings of the same length over a fixed alphabet, find a tree T leaf-labelled by S and with all internal nodes labelled by strings of the same length over the same alphabet which minimizes the sum of the edge lengths.• Motivation: seeks to minimize the total number of point mutations needed to explain the data•NP-hard11Solving MP (maximum parsimony) and ML (maximum likelihood)Phylogenetic treesMP scoreGlobal optimumLocal optimum• Why are MP and ML hard? The search space is huge -- there are (2n-5)!! trees, it is easy to get stuck in local optima, and there can be many optimal trees.• Why try to solve MP or ML? Our experimental studies show that polynomial time algorithms don’t do as well as MP or ML when trees are big and have high rates of evolution.• Why solve MP and ML well? Because trees can change in biologically significant ways with small changes in objective criterion. (Open problem!)Using divide-and-conquer for MP and ML• Conjecture: better (more accurate) solutions will be found in less time, if we analyze a small number of smaller subsets and then combine solutions• Need: – 1. techniques for decomposing datasets, – 2. base methods for subproblems, and – 3. techniques for combining subtrees12Comparison between TBR and the Ratchet• Quite dramatic differences -- the Ratchet finds better trees than the best ways of running TBR branch-swapping, on all our datasets• Even the Ratchet can take too long on some datasets!Ochoterena dataset: 834 DNA sequencesThe DCM3 technique for speeding up MP/ML searches13Strict Consensus Merger (SCM)DCM3-boosting a base method 1. Decompose the dataset into smaller, overlapping subsets, using DCM32. Construct phylogenetic trees on the subsets using a base method3. Merge the subtrees into a single tree using the Strict Consensus Merger4. Use PAUP* constrained search to refine the resultant tree14What we found• I-DCM3-TBR is much faster than TBR on all the datasets we examined• I-DCM3-Ratchet is better than the Ratchet, but by less (depends on dataset)• I-DCM3-ML improves upon ML using PAUP* ML searches (by a huge amount)What we found• DCM3-TBR is much faster than TBR on all the datasets we

View Full Document


School:
Email:
New Password:
Confirm Password:

UCF CAP 5937 - Challenges in constructing very large evolutionary trees

Sign up for free to view:

Please select your school