UMD CMSC 828G - Visual and statistical comparison of metagenomes - D2739721

Home> Schools> University of Maryland, College Park> Computer Science (CMSC) > CMSC 828G> Visual and statistical comparison of metagenomes

UMD CMSC 828G - Visual and statistical comparison of metagenomes

School name University of Maryland, College Park

Course Cmsc 828g- Advanced Topics in Information Processing:Data-Intensive Computing with MapReduce

Pages 7

Download Save

Unformatted text preview:

[18:15 3/7/2009 Bioinformatics-btp341.tex] Page: 1849 1849–1855BIOINFORMATICS ORIGINAL PAPERVol. 25 no. 15 2009, pages 1849–1855doi:10.1093/bioinformatics/btp341Genome analysisVisual and statistical comparison of metagenomesSuparna Mitra1,∗, Bernhard Klar2and Daniel H. Huson11Center for Bioinformatics ZBIT, Tübingen University, Sand 14, 72076 Tübingen and2Institute for Stochastics,Karlsruhe University, Kaiserstraße 89, 76133 Karlsruhe, GermanyReceived on January 26, 2009; revised and accepted on May 29, 2009Advance Access publication June 10, 2009Associate Editor: Dmitrij FrishmanABSTRACTBackground: Metagenomics is the study of the genomic contentof an environmental sample of microbes. Advances in the through-put and cost-efﬁciency of sequencing technology is fueling a rapidincrease in the number and size of metagenomic datasets beinggenerated. Bioinformatics is faced with the problem of how to handleand analyze these datasets in an efﬁcient and useful way. One goalof these metagenomic studies is to get a basic understanding ofthe microbial world both surrounding us and within us. One majorchallenge is how to compare multiple datasets. Furthermore, there isa need for bioinformatics tools that can process many large datasetsand are easy to use.Results: This article describes two new and helpful techniques forcomparing multiple metagenomic datasets. The ﬁrst is a visualizationtechnique for multiple datasets and the second is a new statisticalmethod for highlighting the differences in a pairwise comparison.We have developed implementations of both methods that aresuitable for very large datasets and provide these in Version 3 ofour standalone metagenome analysis tool MEGAN.Conclusion: These new methods are suitable for the visualcomparison of many large metagenomes and the statisticalcomparison of two metagenomes at a time. Nevertheless, more workneeds to be done to support the comparative analysis of multiplemetagenome datasets.Availability: Version 3 of MEGAN, which implements all ideaspresented in this article, can be obtained from our web site at:www-ab.informatik.uni-tuebingen.de/software/megan.Contact: [email protected] information: Supplementary data are available atBioinformatics online.1 INTRODUCTIONMetagenomics is the study of the genomic content of anenvironmental sample of microbes. Jo Handelsman coined the termin 1998, so the year 2008 marks the 10th birthday of metagenomics(Handelsman et al., 1998). As of January 2009, 51 metagenomeprojects have been completed, 86 are ongoing and there are manynew metagenomics projects producing a huge amount of DNAsequences (Bernal et al., 2001). Advances in the throughput andcost-efficiency of sequencing technology is fueling a rapid increasein the number and size of metagenomic datasets being generated.∗To whom correspondence should be addressed.Researchers are now able to study the DNA of a wider range ofmicroorganisms and genes on a more complete and detailed scale.The basic questions of interest are: which species are present in agiven environment, and what types of genes, functions or pathwaysare present in the DNA or actually active in the sample? As researchbegins to answer these basic questions, the focus will shift to thecomparison of different datasets, because researchers will want todetermine and understand the similarities and differences betweenthe metagenomes of different environments.There are a number of different systems and resources formetagenome or similar analysis, which are offered in the formof databases, web portals, web services and very basic standaloneprograms (Dutilh et al., 2008; Krause et al., 2008; Lozupone et al.,2006; Markowitz et al., 2006, 2008; McHardy et al., 2006; Meyeret al., 2008; Overbeek et al., 2005; Seshadri et al., 2007; Teelinget al., 2004; von Mering et al., 2007). These resources are mainlyfocused on the analysis of individual metagenomes and currently donot have the capacity for rapid and highly interactive comparison ofmultiple datasets. In our experience, currently only the MG-RASTweb server (Meyer et al., 2008; Overbeek et al., 2005) provides areadily useable service for analysis of a new metagenomic dataset.However, while web portals are attractive because they offer largecomputational resources for data analysis, some scientists haveconcerns about uploading their unpublished data to a web site.At the beginning of 2007, we released and published the firstpublicly available, standalone analysis tool for metagenomic data,called MEGAN (Huson et al., 2007). We initially developed this toolto analyze the microbial community present in a sample of mammothbone (Poinar et al., 2006). To use MEGAN in a typical metagenomeproject, DNA reads should be collected from the sample using arandom shotgun protocol. Next, a sequence comparison of all readsagainst one or more reference databases is performed using BLAST(Altschul et al., 1990) or a similar comparison tool. MEGAN takesthe result as input and produces a taxonomical analysis of thesample, obtained by assigning the reads to different nodes in theNCBI taxonomy using the ‘LCA-assignment’. A read often matchesmore than one database entry, in which case, the LCA-algorithmassigns reads to the lowest common ancestor of the hits. This is acombinatorial algorithm to estimate the taxonomical content of ametagenome based on sequence comparisons. For more details, seeHuson et al. (2007). Additionally, the gene content is analyzed usingCOGs (Tatusov et al., 1997). As an exploration tool designed andoptimized to run on a laptop, MEGAN allows interactive explorationof metagenomic datasets, both at a high level and also at a verydetailed level.© The Author 2009. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected] 1849[18:15 3/7/2009 Bioinformatics-btp341.tex] Page: 1850 1849–1855S.Mitra et al.In this article, we introduce some recent extensions to MEGANthat allow the comparative analysis of multiple datasets. Our mainaim is to provide a simple but powerful tool that quickly providesan impression of the similarity between multiple datasets and, ina pairwise comparison, highlights taxa for which the number ofassigned reads differs in a statistically significant way. We describethese two new techniques and illustrate their use by comparingthe content of a obese mouse dataset with a lean mouse dataset(Turnbaugh et al., 2006),

View Full Document


School:
Email:
New Password:
Confirm Password:

UMD CMSC 828G - Visual and statistical comparison of metagenomes

Sign up for free to view:

Please select your school