DOC PREVIEW
Toronto CSC 2427 - The Fragment Assembly String Graph

This preview shows page 1-2 out of 7 pages.

Save
View full document
Premium Document
Do you want full access? Go Premium and unlock all 7 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Vol 21 Suppl 2 2005 pages ii79 ii85 doi 10 1093 bioinformatics bti1114 BIOINFORMATICS Genes and Genomes The fragment assembly string graph Eugene W Myers Department of Computer Science University of California Berkeley CA USA ABSTRACT We present a concept and formalism the string graph which represents all that is inferable about a DNA sequence from a collection of shotgun sequencing reads collected from it We give time and space efficient algorithms for constructing a string graph given the collection of overlaps between the reads and in particular present a novel linear expected time algorithm for transitive reduction in this context The result demonstrates that the decomposition of reads into k mers employed in the de Bruijn graph approach described earlier is not essential and exposes its close connection to the unitig approach we developed at Celera This paper is a preliminary piece giving the basic algorithm and results that demonstrate the efficiency and scalability of the method These ideas are being used to build a next generation whole genome assembler called BOA Berkeley Open Assembler that will easily scale to mammalian genomes Contact gene eecs berkeley edu 1 INTRODUCTION Paired end whole genome shotgun sequencing has become the prevailing paradigm for the first phase of sequencing an organism s genome Weber and Myers 1997 Adams et al 2000 Venter et al 2001 and routinely delivers 95 99 of the euchromatic sequence in large scaffolds of ordered and oriented contigs The experiments required to finish the remaining few percent are an order of magnitude more expensive than the shotgun sequencing For this reason only the most important reference genomes will likely ever to be finished Thus improved algorithms and software for whole genome shotgun sequencing can have a large impact on genomic science For example an assembler that takes a 97 reconstruction and improves it to 99 is reducing the amount of unresolved sequence by a factor of three from 3 to 1 and typically improving contig sizes by a factor of 10 or more by resolving 90 of the gaps according to the Poisson statistical theory of sampling Lander and Waterman 1988 There are currently a number of assemblers capable of wholegenome assembly that all perform comparably by employing variations on the basic paradigm of first finding and assembling unique stretches of DNA with high reliability Myers et al 2000 Aparicio et al 2002 Mullikin and Ning 2003 Jaffe et al 2003 Huang et al 2003 Recurrent strategies include finding mutually reinforcing read pairs and examining the relationship of a read to all others in order to assess its repetitiveness and to correct errors The central objective for better assembly is to effectively resolve repetitive sequences Consider perfect data and a genome that has several perfect repeats whose lengths are longer than any read as shown in Figure 1 Imagine that the genome is a piece of thread and meld or collapse all the like repetitive elements as illustrated in Figure 1 We call the resulting graph a string graph and it effectively represents everything that can Genome String Graph 2 2 3 Fig 1 A genome and its string graph The thick arrows of the same shade represent identical repetitive sequences The numbers in the string graph give the number of copies of each repeat inferable by counting entry and exits into the collapsed segment be inferred from the read data If one can identify arcs corresponding to unique sequence then a simple flow analysis reveals how many copies of each repeat are collapsed together and if one replicates each of those edges according to their copy count then the expanded graph has an Eulerian tour and one of those tours corresponds to the original sequence We show how to build such a graph in this paper In 1992 this author and Waterman Idury presented two new ideas for fragment assembly in back to back talks at a DIMACs workshop on bioinformatics which were later published in the same volume of a journal Myers 1995 Idury and Waterman 1995 Myers approach was based on finding maximal interval subgraphs in the graph of all read overlaps and was subsequently developed into the unitig concept of the Celera assembler Waterman and Idury presented an approach based on building the de Bruijn graph of all kmers from the reads and then finding paths in this graph supported by the reads This approach was subsequently extended by Pevzner et al 2001 of his research group giving rise to the Euler assembler While the idea of a string graph is explicit in the Euler algorithms we show in this paper that a string graph directly follows from the unitig algorithm as well What this amounts to is a demonstration that the idea of kmers is unnecessary that one can work directly from the reads obviating the need for the complex read based path splitting of Euler and giving rise to a much more space efficient algorithm one that can scale to a mammalian genome on current hardware Our approach requires a transitive reduction step and we give a novel linear expected time algorithm for this in our context We also show how to treat contained reads in an efficient way and how to efficiently solve large parts of the rigorously formulated minimum cost network flow problem that is our last step with a series of simplifying reductions Finally the problem of orientation i e that reads can be from either the forward or reverse strand is The Author 2005 Published by Oxford University Press All rights reserved For Permissions please email journals permissions oxfordjournals org ii79 E W Myers addressed and accounted for in our string graph formulation This work is a first report on a new line of algorithm development so we conclude with some preliminary empirical results and a brief discussion of future work 2 o f end f B comp f g g E g f f g f g o g beg g f f E o g end BUILDING A STRING GRAPH Consider a genome sequence S of length G and a shotgun data set of n reads f1 f2 fn randomly sampled from either the forward or reverse strand of the genome Let f len be the length of read f and let f 1 f 2 f 3 f f len be its DNA sequence over the alphabet A C G T The average over sampling of the genome is c N G where N f f len is the total amount of sequence in the dataset We assume that the reads have been preprocessed so that each is truly a sample of the genome i e contains no vector sequence or other contaminant and that the mean error rate of the read s sequence is less than e g 2 5 under an exponentially


View Full Document

Toronto CSC 2427 - The Fragment Assembly String Graph

Download The Fragment Assembly String Graph
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view The Fragment Assembly String Graph and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view The Fragment Assembly String Graph and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?