This preview shows page 1-2 out of 6 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 6 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 6 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 6 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Fast phylogenetic DNA barcodingIntroductionMaterial and methodsResultsBenchmark analysisRe-analysis of ancient DNA environmental samplesDiscussionThis work was funded by the Lundbeck Foundation.ReferencesFast phylogenetic DNA barcodingKasper Munch1,*, Wouter Boomsma2, Eske Willerslev3and Rasmus Nielsen4,51Department of Integrative Biology, and5Departments of Integrative Biology and Statistics,University of California, Berkeley, CA 94720-3140, USA2Bioinformatics Centre, University of Copenhagen, Ole Maaløes Vej 5, 2200 København N, Denmark3Department of Biology and Centre for Ancient Genetics, and4Department of Biology,University of Copenhagen, Universitetsparken 15, 2100 København Ø, DenmarkWe present a heuristic approach to the DNA assignment problem based on phylogenetic inferencesusing constrained neighbour joining and non-parametric bootstrapping. We show that this methodperforms as well as the more computationally intensive full Bayesian approach in an analysis of500 insect DNA sequences obtained from GenBank. We also analyse a previously published datasetof environmental DNA sequences from soil from New Zealand and Siberia, and use these datato illustrate the fact that statistical approaches to the DNA assignment problem allow for moreappropriate criteria for determining the taxonomic level at which a particular DNA sequence canbe assigned.Keywords: assignment; barcoding; phylogenetics; neighbour joining1. INTRODUCTIONDNA barcoding is the use of DNA sequences foridentifying unknown biological specimens. A DNAsequence is obtained for a particular marker, typicallycytochrome oxidase I in animals, and this sequence iscompared to a DNA database to determine to whichspecies or other taxonomic unit the specimen belongs.DNA barcoding is, in one form or another, widely usedin conservation genetics and molecular ecology (e.g.Duminil et al. 2006; Rubinoff 2006; Ward et al. 2008),but is also used in a number of other areas includingforensic applications (e.g. Dawnay et al. 2007) andancient DNA studies (e.g. Willerslev et al. 2007). It hasoften been associated with methods for delineating anddefining species based on DNA evidence (e.g. Floydet al. 2002; Hebert et al. 2003; Remigio & Hebert 2003;Moritz & Cicero 2004). However, in this paper, we willsolely consider the statistical question of how to assignDNA sequences to a priori defined taxonomical units.This fundamental statistical problem has beenaddressed in a number of studies (e.g. Matz & Nielsen2005; Meyer & Paulay 2005; Steinke et al. 2005;Nielsen & Matz 2006; Abdo & Golding 2007; Munchet al. 2008). We recently proposed a Bayesian approachbasedonacombinationofautomateddatabasesearches, alignment and Bayesian phylogenetic infer-ence (Munch et al. 2008). The objective of thisapproach is to approximate the posterior probabilitythat the unknown specimen belongs to a specificspecies or taxonomic group. This is done by firstobtaining a number of sequences with high homologyto the unknown specimen using database searches,aligning these sequences to each other and theunknown specimen, and then determining the pos-terior probability of membership of a particular groupusing a Markov chain Monte Carlo (MCMC)approach similar to the one commonly used inphylogenetic inference (e.g. Yang & Rannala 1997;Huelsenbeck & Ronquist 2001). Under the assumptionthat the sequences in the alignment include all relevantspecies, the MCMC output can be directly processedto give the desired probabilities of taxon membership.This method was implemented in a computer program‘Statistical Assignment Package’ (SAP; Munch et al.2008), and was used in several applications, includingthe analysis of hundreds of ancient DNA sequencesfrom ice cores from the Greenlandic ice (Johnson et al.2007; Willerslev et al. 2007).While the method in SAP was found to have goodstatistical performance on real and simulated datasets(Munch et al. 2008), it may not be easily applicable tolarge-scale datasets, such as the datasets produced inmetagenomics applications. In such applications,thousands or hundred of thousands of sequences arebeing analysed, rendering MCMC-based approachescomputationally intractable. In this paper, therefore,we explore the possibility of using the neighbour-joining algorithm (Saitou & Nei 1987) in combinationwith bootstrapping ( Felsenstein 1985) as a heuristicapproach to approximate the posterior probabilities.An alternative approach is to interpret bootstrapproportions in a frequentist framework to make assign-ments based on hypothesis testing (e.g. Nielsen & Matz2006). The Bayesian interpretation of bootstrapproportions used here has the advantage that it allowsfor the possibility of using decision theory to devisecriteria for assignment (Abdo & Golding 2007). We willshow that while there often are large differencesbetween posterior probabilities and bootstrap pro-portions, the neighbour joining with bootstrapapproach nonetheless performs quite well as a methodPhil. Trans. R. Soc. B (2008) 363, 3997–4002doi:10.1098/rstb.2008.0169Published online 7 October 2008One contribution of 17 to a Discussion Meeting Issue ‘Statistical andcomputational challenges in molecular phylogenetics and evolution’.* Author for correspondence ([email protected]).3997 This journal is q 2008 The Royal Societyfor DNA barcoding inference. As with all otherapproaches, the inferences are only as good as thedatabase used. The method does not model species notrepresented in the database, and can lead to wronginferences if the database is not representative.2. MATERIAL AND METHODSSAP implements automatic assignment of sample sequencesto taxa based on the position of the sample sequence in thephylogeny of life. In the first presentation of SAP (Munchet al. 2008) a Bayesian approach was taken, using MCMC toestimate the posterior probabilities that the sample sequenceforms a monophyletic group together with a particularmonophyletic clade.Ideally, all available homologues available in the databaseshould be included in such analyses. However, due to thecomputational complexity of running the MCMC analysis,a heuristic is instead applied to compile a representative setof sequence homologues. We use BLAST searches againstGenBank to identify homologues and retrieve sequences andtaxonomic annotation for each one, disregarding homologueswith insufficient annotation. By including only homologueswith a BLAST score of at least half that of the best matchinghomologue, we exclude


View Full Document

UMD CMSC 828G - Fast phylogenetic DNA barcoding

Documents in this Course
Lecture 2

Lecture 2

35 pages

Load more
Download Fast phylogenetic DNA barcoding
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Fast phylogenetic DNA barcoding and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Fast phylogenetic DNA barcoding 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?