CMU BSC 03711 - Sense from sequence reads: methods for alignment and assembly - D2165653

Home> Schools> Carnegie Mellon University> Biological Sciences (BSC) > BSC 03711> Sense from sequence reads: methods for alignment and assembly

DOC PREVIEW

CMU BSC 03711 - Sense from sequence reads: methods for alignment and assembly

School name Carnegie Mellon University

Course Bsc 03711- Computational Molecular Biology and Genomics

Pages 8

This preview shows page 1-2-3 out of 8 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 8 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 8 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 8 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 8 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Sense from sequence reads: methods for alignment and assembly AlignmentHash-based alignment methods. Burrows-Wheeler transform methods. AssemblyDiscussionFigure 1 | Schematic of a hash table-based alignment strategy. Figure 2 | The Burrows-Wheeler transform for genomic sequence data. Figure 3 | Constructing and visualizing a de Bruijn graph of a DNA sequence. ACKNOWLEDGMENTSCOMPETING INTERESTS STATEMENT REFERENCESThe advent of ultra-high-throughput sequencing tech-nology has captured the imagination of the biological sciences, and with good reason. Ten years ago, on 23 November 1999, the publicly funded human genome project held a massive, worldwide celebration to mark the completion of 1 billion base pairs (bp), one-third of the way to the full sequence of the human genome (http://www.genome.gov/10002105/). The amount of sequence was so incredible at the time that the cel-ebration featured senators and US cabinet officials. Commemorative T-shirts marking the occasion were distributed. Today, sequencing 1 billion bp is the work of hours in any lab equipped with an Illumina GAII or ABI SOLiD ‘second generation’ sequencing machine and the work of minutes in large-scale sequencing centers. These large centers may have 40 or more such machines sitting in the middle of a massive production pipeline that requires a substantial wet-lab work flow to feed the sequencing machines and almost unimaginable compu-tational support to make any sense of the data coming from the sequencers. All indications suggest we have only scratched the surface of the potential for ubiquitous DNA sequencing technology to change the way experi-ments are conducted and biology is understood.One thing that has not changed in the last 10 years is that the individual outputs of the sequence machines are essentially worthless by themselves. The individual ‘reads’ (named so from the days when the sequence of a given DNA molecule was determined by a pair of human eyes looking down an autoradiograph of a gel that had a separate lane for each base) range from approximately 800 bp, using the older technology used by the human genome project, to approximately 30 bp for the intro-ductory versions of the second-generation sequencing machines so popular today. Current output ranges from 50 to 400 bp, depending on the technology and the spe-cific biological application. Although uninformative by themselves, once analyzed collectively DNA sequencing reads have tremendous versatility, and the existing appli-cations of next-generation sequencing are extensive. Fundamental to creating biological understanding from the increasing piles of sequence data is the development of analysis algorithms able to assess the success of the experiments and synthesize the data into manageable and understandable pieces.We will focus on two of the most fundamental com-putational analyses in the context of sequence analysis: alignment and assembly. When a reference genome assembly exists (for example, for human or mouse), alignment remains the first and most fundamental analysis once the DNA sequence has been produced. The results of the alignment have the ability to quickly determine whether the sequencing experiment has suc-ceeded, whether the correct sample was sequenced, and whether the biological experiment and DNA prepara-tion succeeded. For organisms without a sequenced reference genome, assembly is almost always essential for analysis. However, in order to develop algorithms to accurately assemble new genomes, the existence of already assembled reference genomes in other species is Sense from sequence reads: methods for alignment and assemblyPaul Flicek & Ewan BirneyThe most important first step in understanding next-generation sequencing data is the initial alignment or assembly that determines whether an experiment has succeeded and provides a first glimpse into the results. In parallel with the growth of new sequencing technologies, several algorithms that align or assemble the large data output of today’s sequencing machines have been developed. We discuss the current algorithmic approaches and future directions of these fundamental tools and provide specific examples for some commonly used tools.European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, UK. Correspondence should be addressed to P.F. ([email protected]).PUBLISHED ONLINE 15 OctOBEr 2009; cOrrEctED ONLINE 6 may 2010; DOI:10.1038/NmEtH.1376S6 | VOL.6 NO.11s | NOVEMBER 2009 | nature methodS SuPPLementreview © 2009 Nature America, Inc. All rights reserved.all represented by the same color). Hence, some sequence errors are correctable, assuming that the analysis tools explicitly consider this aspect of the data5. The Illumina technology uses other techniques to remove likely errors earlier in the processing pipeline (that is, before the alignment or assembly process), including ‘purity filtering’ reads appearing to come from more than one DNA molecule6.alignmentAlignment itself is the process of determining the most likely source within the genome sequence for the observed DNA sequencing read, given the knowledge of which species the sequence has come from. Sequencing reads may also be aligned to other genomes, assum-ing the evolutionary distance between the genomes is appropriate. The most widely used alignment programs for second-generation sequence data have been explicitly designed (or modified) for the purpose of aligning this data. Unlike earlier-generation sequence alignment programs such as BLAST, which were designed in an environment that required alignments of protein sequences and searching though large databases to find homologous sequences, today’s short-read alignment programs are generally used for the alignment of DNA sequence from the species of interest to the ref-erence genome assembly of that species. This difference, although it initially seems subtle, has several consequences to the final algo-rithm design and implementation, which include letting assump-tions about the number of expected mismatches be driven by the species polymorphism rate and the technology error rate rather than by considerations of evolutionary substitutions.In general, these assumptions allow for much faster processing, as few low-quality alignments are either expected or scored. Given the massive data volumes produced by the present sequencing machines, this has also allowed alignments to be calculated without a correspondingly massive

View Full Document