This preview shows page 1-2-3-4-25-26-27-51-52-53-54 out of 54 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 54 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 54 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 54 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 54 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 54 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 54 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 54 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 54 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 54 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 54 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 54 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 54 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Assigning Sequences to TaxaCMSC828GOutlineObjective (1 slide)MEGAN (17 slides)SAP (33 slides)Conclusion (1 slide)ObjectiveGiven an unknown, environmental DNA sequence:Make a taxonomic assignment by comparing the sample sequence to existing database sequences that have already been taxonomically labeled** There is no attempt to characterize new species!MEGAN — Metagenome AnalyzerHuson et al. 2007Software that enables rapid analysis of large metagenomic data setsMEGAN 3 is the latest released version of the programAvailable for UNIX, Windows, and Mac OS XMEGAN Processing PipelineReads are collected from a sample using any random shotgun sequencing protocolA sequence comparison of all reads against one or more sequence databases is performedMEGAN processes the results of the comparison and assigns each read to a taxon using the lowest common ancestor (LCA) algorithmMEGAN Processing PipelineBLAST Optionsmin-score — an alignment must achieve min-score to be included in the analysistop-percent — retain only those matches whose score is within top-percent of the highest scorewin-score — if a match scores above win-score, only consider other matches above win-scoremin-support — at least min-support reads must be assigned to a taxon for those assignments to countLCA AlgorithmData Analyses with MEGANSargasso Sea data setMammoth data setSpecies identification from short readsE. coli K12B. bacteriovorus HD100Sargasso Sea Data SetVenter et al. 2004Samples of seawater were collected, and organisms of size 0.1–3 µm were extracted and sequencedFrom four individual sampling sites, ∼1.66 million reads of average length 818 bp were recoveredBiological diversity and abundance were measured using environmental assemblies, and by analyzing six phylogenetic markers (rRNA, RecA/RadA, HSP70, RpoB, EF-Tu, and Ef-G)Revealing “Microheterogeneity”Distribution of Species ComparisonMammoth Data SetPoinar et al. 20061g bone sample taken from a mammoth that was preserved in permafrost for 28,000 yearsObtained 302,692 reads of mean length 95 bpBLASTZ was used to determine reads that came from the mammoth genome, and BLASTX was used to characterize the remaining environmental diversityMammoth Data Set SummaryBit score threshold of 30, discarding any isolated assignmentsSpecies Identification from Short ReadsWhat is the minimum read length required to identify species in a metagenomic sample?Idea: simulate short reads from a known genome, and then evaluate accuracy of assignmentsTwo organisms were chosen for this purpose—E. coli, and B. bacteriovorusThese two organisms were also randomly resequenced (and then subsequently analyzed)E. coli Simulation ResultsBasically no false positivesE. coli Resequencing ResultsA few false positivesB. bacteriovorus Simulation ResultsBasically no false positivesB. bacteriovorus Resequencing ResultsMEGAN, in SummaryLCA algorithm is simple and conservativeDoes not make many false positive assignments, even when the unknown sample sequence does not exist in the databaseSpecies can be identified from short readsMost of the work has been in developing easy to use software with useful exploratory features and visualizations, many of which were not mentionedLimitations of BLASTBLAST searches use local alignments, not global alignments, which leads to a loss of informationBLAST searches do not consider the population genetic and phylogenetic issues associated with species identificationThe measures of confidence associated with BLAST searches (E-values) represent significance of local similarity, not significance of taxonomic assignmentSAP — Statistical Assignment PackageMunch et al. 2008SAP is an automated method for DNA barcoding which includes database sequence retrieval, alignment, and phylogenetic analysisMost importantly, provides statistically meaningful measures of confidenceLike MEGAN, does not attempt to identify new speciesSAP - An OverviewBayesian ApproachEstimate the probability the sample sequence is part of a monophyletic group of database sequencesX is the sample-sequence,Ti is taxon i, and D is the set of database sequences representing k disjoint groupsComputing the Posterior ProbabilityThe posterior probability involves a summation over all possible phylogenetic trees, and for each tree, a multiple integral over all combinations of evolutionary model parametersHence, the posterior probability cannot be computed analytically, even for small treesHowever, a method called Markov Chain Monte Carlo (MCMC) can be used to sample trees in proportion to their posterior probabilitiesSampling the Posterior DistributionFinding HomologsIdeally, each sample sequence would be compared with all database sequencesInstead, a heuristic is required to extract a limited representation of the databaseThus, SAP uses BLAST to find database homologsFinding Homologs, MethodInclude only matches whose BLAST score is at least half that of the best match (relative cutoff)Include only the best match from each speciesInclude up to 30 species homologs, 10 genera, 6 families, 5 orders, 3 classes, and 2 phylaIf the relative cutoff has been reached before 50 homologs have been included, allow other representatives from species already includedMSA and Phylogenetic AnalysisThe sample sequence and the set of homologs are aligned using ClustalWA program, likely some kind of MrBayes kernel, performs the Bayesian phylogenetic analysisAll sequences except the sample sequence are topologically constrained to agree with the NCBI taxonomy10,000 trees are sampled from the posterior distribution and analyzed to obtain probabilities of assignment to all taxa in the set of homologsTaxonomic AssignmentThe probability of forming a monophyletic group with a given taxon is calculated as the fraction of sampled trees where the sister clade to the sample sequence is a member of that taxon.Probabilities of AssignmentComputational TimeTakes time to download sequences from GenBankMultiple alignment is fast, a couple of minutesThe MCMC analysis is the bottleneck, averaging 1 hourPost-processing of MCMC output may take 10 minutes(and this is for each sample sequence!)Benchmark AnalysesCytochrome Oxidase I (COI) gene for the class Insecta10,804 sequencestRNA-Leu (trnL) gene for the class Liliopsida (monocots)640 sequencesBenchmarking ResultsComparison with BLASTReanalysis of Neanderthal SequencesIn a number of studies, longer ancient DNA sequences were assembled from shorter readsHowever, what if some of


View Full Document

UMD CMSC 828G - Assigning Sequences to Taxa

Documents in this Course
Lecture 2

Lecture 2

35 pages

Load more
Download Assigning Sequences to Taxa
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Assigning Sequences to Taxa and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Assigning Sequences to Taxa 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?