DOC PREVIEW
UMD CMSC 828G - A DNA composition-based algorithm for bin- ning environmental shotgun reads

This preview shows page 1-2-3-4-5-6 out of 19 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 19 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 19 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 19 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 19 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 19 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 19 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 19 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

CompostBin: A DNA composition-based algorithm for bin-ning environmental shotgun readsSourav Chatterji1, Ichitaro Yamazaki2, Zhaojun Bai2, Jonathan A Eisen∗ 1,3,41UC Davis Genome Center2Department of Computer Science, UC Davis3Section of Evolution and Ecology, UC Davis4Department of Medical Microbiology and Immunology, UC DavisEmail: Sourav Chatterji - [email protected]; Ichitaro Yamazaki - [email protected]; Zhaojun Bai - [email protected];Jonathan A Eisen - [email protected];∗Corresponding authorAbstractA major hindrance to studies of microbial diversity has been that the vast majority of microbes cannot becultured in the laboratory and thus are not amenable to traditional methods of characterization. Environmentalshotgun sequencing (ESS) overcomes this hurdle by sequencing the DNA from the organisms present in amicrobial community. The interpretation of this metagenomic data can be greatly facilitated by associatingevery sequence read with its source organism. We report the development of CompostBin, a DNAcomposition-based algorithm for analyzing metagenomic sequence reads and distributing them intotaxon-specific bins. Unlike previous methods that seek to bin assembled contigs and often require training onknown reference genomes, CompostBin has the ability to accurately bin raw sequence reads without need forassembly or training. It applies principal component analysis to project the data into an informativelower-dimensional space, and then uses the normalized cut clustering algorithm on this filtered data set toclassify sequences into taxon-specific bins. We demonstrate the algorithm’s accuracy on a variety of simulateddata sets and on one metagenomic data set with known species assignments. CompostBin is a work in progress,with several refinements of the algorithm planned for the future.1arXiv:0708.3098v1 [q-bio.GN] 22 Aug 2007BackgroundMicrobes are ubiquitous organisms that play pivotal roles in the earth’s bio-geochemical cycles. Their mostvisible effects on human well-being arise through their roles as mutualistic symbionts and hazardouspathogens. The study of microbes is crucial to our understanding of the earth’s life processes and humanhealth. Most of our knowledge about microbes has been obtained through the study of organisms culturedin artificial media in the laboratory. Although this approach has provided profound biological insights, it isinadequate for studying the structure and function of many microbial communities. One obstacle has beenthat the vast majority of microbes have not been cultured and may not be culturable [1]. Even thoughculture independent methods such as 16S rRNA surveys [2, 3] have been developed, they are unable tosimultaneously answer two fundamental questions: Who is out there? and What are they doing? Theapplication of genome sequencing methods is revolutionizing this field by enabling us for the first time toaddress those two questions for unculturable microbial communities [4–6]. These techniques, calledenvironmental genomics or metagenomics, study unculturable communities by analyzing the pooledgenomes of all the organisms present in the community. The genomic data obtained can be analyzed tomake inferences about both who is out there and what they are doing (e.g., [7]).In one specific metagenomic method, environmental shotgun sequencing (ESS), DNA pooled from amicrobial community is sampled randomly using whole genome shotgun sequencing. Thus, ESS data ismade up of sequence reads from multiple species. This adds an additional layer of complexity compared tosingle-species genome sequencing, as it requires analysis of the metagenomic data in order to associate eachsequence read with its source organism. Therefore, a critical first step in many metagenomic analyses is thedistribution of reads into taxon-specific bins.The difficulty of accurately binning ESS reads from whole genome data remains a significant hurdle inmetagenomics. The taxonomic resolution achievable by the analysis depends on both the binning methodand the complexity of the community. For instance, binning into species-specific bins can be achieved inlow-complexity microbial communities (e.g., the dual-bacterial symbiosis of sharpshooters [7]). However,the problem becomes more difficult in high-complexity communities with hundreds of species, such as theSargasso Sea [4] and the human distal gut [6]. Because of these difficulties, many metagenomic studies(e.g., [8]) have resorted to analyzing at the level of the metagenome, essentially treating a microbialcommunity as a bag of genes. This is not a satisfactory solution. Identifying and characterizing individualgenomes can provide deeper insight into the structure of the community [7].A variety of approaches have been developed for binning: assembly, phylogenetic analysis [9], database2search [10], alignment with reference genome [11] and DNA composition metrics [12, 13] Most currentbinning methods suffer from two major limitations: they require closely related reference genomes fortraining/alignment and they perform poorly on short sequences. To overcome the second difficulty, almostall current binning methods are applied to assembled contigs. However, most of the current generationassemblers can be confounded by metagenomic data since they implicitly assume that the shotgun data isfrom a single individual or clone. Therefore, we believe that assembly is risky when binning and that it isnecessary to analyze raw sequence reads to get an unbiased look at the data.To overcome the above-mentioned disadvantages of other binning methods, we have developedCompostBin, a binning algorithm based on DNA composition. CompostBin can bin raw sequence readsinto taxon-specific bins with high accuracy and does not require training on currently available genomes.Like other composition-based methods, it seeks to distinguish different genomes based on theircharacteristic DNA compositional patterns, termed ”signatures.” For example, one of the most commonlyused DNA metrics measure the frequency of occurrence of Kmers (oligonucleotides of length K)in a DNAsequence. Kmer frequencies have been used to distinguish between organisms since the 1960s [14]. Withthe explosion of available genomic data in the 1990s, Karlin and colleagues were able to establish that therelative abundances of various dinucleotide sequences (the dinucelotides odds ratio) is a genomicsignature [15]. Subsequently, taxon-specific biases were


View Full Document

UMD CMSC 828G - A DNA composition-based algorithm for bin- ning environmental shotgun reads

Documents in this Course
Lecture 2

Lecture 2

35 pages

Load more
Download A DNA composition-based algorithm for bin- ning environmental shotgun reads
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view A DNA composition-based algorithm for bin- ning environmental shotgun reads and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view A DNA composition-based algorithm for bin- ning environmental shotgun reads 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?