New version page

Stanford CS 262 - Study Notes

Documents in this Course
Lecture 8

Lecture 8

38 pages

Lecture 7

Lecture 7

27 pages

Lecture 4

Lecture 4

12 pages

Lecture 1

Lecture 1

11 pages

Biology

Biology

54 pages

Lecture 7

Lecture 7

45 pages

Load more
Upgrade to remove ads

This preview shows page 1 out of 2 pages.

Save
View Full Document
Premium Document
Do you want full access? Go Premium and unlock all 2 pages.
Access to all documents
Download any document
Ad free experience

Upgrade to remove ads
Unformatted text preview:

CommentaryWhole-genome disassemblyPhil Green*Howard Hughes Medical Institute and University of Washington, Seattle, WA 98195The race to sequence the human ge-nome has garnered a level of popularattention unprecedented for a scientificendeavor. This fascination has partly beencaused of course by the importance of thegoal; but it also reflects the Olympiannature of the contest, which opposed twocapable teams with sharply contrastingcultures (public and private), personali-ties, and strategies. Titanic struggles beingthe stuff of mythology, it should perhapsnot surprise us that a number of mythsregarding this race have already emerged.In a recent issue of PNAS, Waterston et al.(1), leaders of the public effort, help todispel one of these myths, involving thecontroversial ‘‘whole-genome shotgun’’strategy used by Celera.Issues surrounding sequencing strate-gies will no doubt seem arcane to mostreaders but are worth considering if onlybecause they may significantly influencethe pace and cost of DNA sequencingduring the remainder of the Genome Era.That a strategy is needed at all arises fromthe fact that a sequencing ‘‘read,’’ the tractof data obtainable in a single experimentalrun, is only a few hundred bases in lengthand contains errors. Getting reliable se-quence of a larger DNA segment there-fore requires a method for generating andpiecing together a number of reads cov-ering the segment. Since its introductionby Sanger and colleagues over 20 yearsago, the favored method for this purposehas comprised the following steps: an ini-tial ‘‘shotgun’’ phase in which reads arederived from subclones essentially ran-domly located within the targeted region;an assembly phase, in which read overlapsare determined (the main challenge herebeing to identify and discard false overlapsarising from repeated sequences) andused to approximately reconstruct the un-derlying sequence; and a finishing phase inwhich additional reads are obtained indirected fashion to close gaps and shoreup data quality where needed. The shot-gun phase usually involves obtaining asubstantial redundancy of read coverageof the target, typically at least 6–8-fold, tominimize the amount of work requiredduring the labor-intensive finishing phase.For the human genome, which com-prises some 3 billion base pairs, the publiceffort adopted a well-tested modular ap-proach in which large fragments of thegenome (roughly 150,000 bp in size) werefirst cloned into a bacterial host (as bac-terial artificial chromosomes or BACs)and then sequenced individually by theshotgun method. Among other advan-tages, this ‘‘clone by clone’’ strategy sim-plifies the assembly problem (by reducingits scale and the likelihood of errorscaused by repeats), generates substantialsequence tracts of known contiguity thatcan be mapped relatively efficiently backto the genome, and yields resources thatare useful in the finishing stage and forindependent tests of assembly accuracy. A‘‘draft’’ version of the genome sequence(based on a somewhat lower shotgundepth coverage for most of the clones)obtained in this way was published lastyear (2).In contrast, Celera adopted a whole-genome shotgun approach, which pur-ports to accelerate the above process bybypassing the intermediate step of cloninglarge fragments and instead derives readsdirectly from the whole genome. The pro-cess is clearly riskier because of the sig-nificantly greater possibility of assemblyerror, but had been successfully used byCelera to produce a near-complete se-quence of the Drosophila genome (3, 4)with about 2,500 gaps. Its ability to copewith the human genome, which is 30-foldlarger and much richer in repetitive se-quences than Drosophila, remained un-clear. Against all odds, Celera demon-strated that it worked (5), producing anindependent human genome sequence ofcomparable or higher quality than thatobtained by the public effort.Or did they?This is the myth that Waterston et al. (1)overturn. Far from constructing an inde-pendent sequence, Celera incorporatedthe public data in three important waysinto their ‘‘whole genome assembly.’’ (i)The assembled BAC sequences from thepublic project were ‘‘shredded’’ in a man-ner that (as Waterston et al. show) re-tained nearly all of the information fromthe original sequence, and used as input.(ii) In a process called ‘‘external gap walk-ing,’’ unshredded, assembled, public BACsequences were used to close gaps. (iii)Public mapping data were used to anchorsequence islands to the genome. As aresult, the assembly reported by Celeracannot be viewed as a true whole-genomeshotgun assembly. Moreover, accuracytests in ref. 5, which involved comparisonof Celera’s assembly to finished portionsof the public sequence, are virtually mean-ingless because the finished sequencewas itself used in constructing the Celeraassembly.We are left with no idea how a truewhole-genome assembly would have per-formed. It is striking, however, that evenwith this use of the public data, whatCelera calls a whole-genome assembly wasa failure by any reasonable standard: 20%of the genome is either missing altogetheror is in the form of 116,000 small islandsof sequence (averaging 2.3 kb in size) thatare unplaced, and for practical purposesunplaceable, on the genome.Several other myths beyond the onediscussed by Waterston et al. have becomewidely accepted. One is that the wholegenome shotgun approach was in largemeasure responsible for Celera’s rapidpace at sequencing the Drosophila andhuman genomes. In fact, their great speedwas mainly because of the acquisition of ahuge, unprecedented sequencing capacity(some 200⫹ capillary machines, each ableto produce 500-1000 reads per day) as aresult of their corporate ties with a man-ufacturer of these machines. That this wasreally the key factor is evident from thefact that when the public effort acquiredsimilar capacity, they were able to attain acomparable or higher throughput by usingthe clone by clone approach.A third myth is that the whole-genomeapproach saves money. Although defini-tive judgement here should await a rigor-ous cost accounting, the basic economicsof sequencing by the clone by clone ap-proach have apparently not changedgreatly over the past 5 or 6 years. Less than10% of the overall cost goes to BACmapping and subclone library construc-tion, 50–60% to the shotgun itself (assum-ing a coverage of 6–8⫻), and the remain-ing 30–40% goes to finishing. Even if


View Full Document
Download Study Notes
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Study Notes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Study Notes 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?