UMD CMSC 828G - accurate determination of microbial diversity - D894248

Home> Schools> University of Maryland, College Park> Computer Science (CMSC) > CMSC 828G> accurate determination of microbial diversity

UMD CMSC 828G - accurate determination of microbial diversity

School name University of Maryland, College Park

Course Cmsc 828g- Advanced Topics in Information Processing:Data-Intensive Computing with MapReduce

Pages 6

Download Save

Unformatted text preview:

Accurate determination of microbial diversity from 454 pyrosequencing dataMETHODSONLINE METHODSDistance between a flowgram and a sequence.Alignment of flowgrams to sequences.An algorithm to remove pyrosequencing noise by flowgram preclustering.Expectation-maximization algorithm.Implementation.Chimera detection.Generation of sequence data.OTU construction: initial noise removal.OTU generation: standard methods.OTU generation after noise removal.OTU assignment accuracy.AcknowledgmentsAUTHOR CONTRIBUTIONSReferencesFigure 1 OTU number as a function of percentage sequence difference for 90 pyrosequenced 16S rRNA gene clones of known sequence.Figure 2 Proportion of sequences assigned to the correct OTU as a function of percentage sequence difference for pyrosequenced 16S rRNA gene clones of known sequence.Table 1  Numbers of complete linkage OTUsbrief communicationsnature methods | VOL.6 NO.9 | september 2009 | 639with length1. The standard base-calling procedure is to round the continuous intensities to integers. Consequently, long homopoly-mers result in frequent miscalls: either insertions or deletions2.We calculated the probability distributions of observing a given signal intensity for each homopolymer length by pyrosequencing the V5 region of 23 clones of known sequence (Online Methods). The sequences of these clones differed by at least 7%, and we could therefore unambiguously associate each flowgram with the sequence that generated it. We aligned flowgrams to their par-ent sequences using an exact Needleman-Wunsch algorithm and then used all signal intensities from each homopolymer length to generate histograms (Supplementary Fig. 1). For lengths greater than 5 homopolymers, for which insufficient data was available to construct histograms (Supplementary Table 1), we used normal distributions with extrapolated parameters (Online Methods).The starting point for our algorithm was the realization that we should work with the light intensities associated with each read, or flowgrams, rather than their translations into sequences. Intuitively two sequences can differ substantially, whereas their flowgrams can be similar. To use an example from a real dataset: the true sequence is ACTGGGG, which without noise and with nucleotides flowed in the order TACG, would give the flowgram 0, 1, 1, 0|1, 0, 0, 4, where the | indicates a new series of the four nucleotides. Instead we observed 0.18, 1.03, 1.02, 0.70|1.12, 0.07, 0.14, 4.65, a flowgram that is not that dissimilar but that trans-lates into a sequence, ACGTGGGGG, with two insertions. Using the flowgrams and the distributions of observed intensities, we defined a distance reflecting the probability that a flowgram was generated by a given sequence3 (Online Methods). These dis-tances were then used in a mixture model to define a likelihood accurate determination of microbial diversity from 454 pyrosequencing dataChristopher Quince1, Anders Lanzén2, Thomas P Curtis3, Russell J Davenport3, Neil Hall4, Ian M Head3, L Fiona Read3 & William T Sloan1We present an algorithm, Pyronoise, that clusters the flowgrams of 454 pyrosequencing reads using a distance measure that models sequencing noise. this infers the true sequences in a collection of amplicons. We pyrosequenced a known mixture of microbial 16s rdna sequences extracted from a lake and found that without noise reduction the number of operational taxonomic units is overestimated but using Pyronoise it can be accurately calculated.Pyrosequencing as implemented by Roche’s 454 is a technology that generates a large number of intermediate length DNA reads through a massively parallel sequencing-by-synthesis approach1. The GS FLX implementation generates ~400,000 reads of ~250 base pairs in a single run. In many environmental genomics applications of pyrosequencing, DNA is extracted from an entire microbial community, and a particular target region flanked by conserved primers is amplified by PCR before sequencing. This generates an amplicon dataset, in which every read stems from a homologous region, and the sequence variation between the reads reflects the phylogenetic diversity in the community. Because there is no cloning step, resequencing to increase accuracy is not possible and it is therefore vital to disentangle noise from true sequence diversity in this type of data.During pyrosequencing, each base, in turn, is washed across a plate with hundreds of thousands of wells in which beads attached to multiple copies of a single DNA molecule are localized. If the first unpaired base in a well is complimentary to the incoming base, then synthesis occurs and through a series of chemical reac-tions light is emitted. Subsequent synthesis and increased light emission will occur if a homopolymer is present. The pattern of light intensities, or flowgram, emitted by each well can then be used to determine the DNA sequence. The major source of noise is that the light intensities do not faithfully reflect the homopolymer lengths. Instead, a distribution of light intensities is associated with each length, and the variance of this distribution increases 1Department of Civil Engineering, Rankine Building, University of Glasgow, Glasgow, UK. 2Department of Biology, University of Bergen, Bergen, Norway. 3School of Civil Engineering and Geosciences, University of Newcastle upon Tyne, Newcastle upon Tyne, UK. 4School of Biological Sciences, University of Liverpool, Liverpool, UK. Correspondence should be addressed to C.Q. ([email protected]).Received 19 FebRuaRy; accepted 10 July; published online 9 august 2009; doi:10.1038/nMeth.1361figure 1 | OTU number as a function of percentage sequence difference for 90 pyrosequenced 16S rRNA gene clones of known sequence. (a,b) Results are repeated for complete linkage (a) and average linkage algorithms (b).Sequence difference (%)101000 2 4 6 80 2 4 6 81,00010,000Number of OTUsSequence difference (%)101001,00010,000Number of OTUsa bTrue number of OTUsOTU assignment using just the pyrosequencing dataRDP pyrosequencing pipelineFlowgram preclustering before OTU generationFormer with chimera removal © 2009 Nature America, Inc. All rights reserved.640 | VOL.6 NO.9 | september 2009 | nature methodsbrief communicationsof observing all the flowgrams assuming that they were generated from a set of true underlying sequences4. We used an iterative expectation-maximization algorithm to maximize this

View Full Document


School:
Email:
New Password:
Confirm Password:

UMD CMSC 828G - accurate determination of microbial diversity

Sign up for free to view:

Please select your school