DOC PREVIEW
Stanford CS 262 - Study Notes

This preview shows page 1 out of 2 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 2 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 2 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

CommentaryOn the sequencing and assembly of thehuman genomeEugene W. Myers*, Granger G. Sutton, Hamilton O. Smith, Mark D. Adams, and J. Craig VenterCelera Genomics, 45 W. Gude Drive, Rockville, MD 20850On June 26, 2000, Celera Genomicsand the International Human Ge-nome Sequencing Consortium (HGSC)announced at the White House the com-pletion of the first assembly of the humangenome and the completion of a roughdraft, respectively. In February of 2001,the two teams simultaneously publishedtheir analyses of the genome sequencesgenerated (1, 2). The joint announcementand subsequent publications were a resultof long discussions among Celera andHGSC scientists on reducing the negativerhetoric and demonstrating to the publicthat both teams were working for thepublic good. Now three laboratory leadersfrom the public consortium, Waterston,Lander, and Sulston (WLS), argue thatCelera did not produce an independentsequence of the human genome or mean-ingfully demonstrate the whole-genomeshotgun (WGS) technique (3). This con-clusion is based on incorrect assumptionsand flawed reasoning.Our Starting Point Was a Shredding of SeveralHundred Thousand Bactigs, Not of the HGSCGenome Assembly.The key assertion ofWLS is that by using information from theHGSC, Celera’s method implicitly re-tained the full assembly structure pro-duced by the HGSC. This is incorrect. Asdescribed in table 2 of ref. 1, we combinedour data with a uniformly spaced 2⫻shredding of 677,708 individual bactigs,contigs of bacterial artificial chromosomes(BAC) clones shotgun sequenced by theHGSC, not the genome assembly reportedin ref. 2. The goal of including this se-quence was to take advantage (with attri-bution) of the work of the HGSC to theextent that it would contribute additionalsequence coverage. The global order andthe overall sequence of the genome weredetermined by using the set of 27 millionmate-paired reads generated at Celera.Mate-pairs are sets of reads that are ad-jacent to one another in the genome andserve to link together nearby segments topromote assembly. The 38.7-fold genomecoverage spanned by these mate-pairsprovided the long-range order (over mil-lions of basepairs) of both assembly meth-ods reported in ref. 1. Without the Celeradata, the best assembly that we could haveproduced would have been the 677,708completely unordered bactigs, assumingthat every shredded bactig would recon-stitute itself during assembly as is claimedby WLS.Simulation Using Chromosome 22 Alone Leadsto a Distorted View of Assembly.WLS use asimulation to argue that a uniformlyspaced 2⫻ shredding would naturally re-sult in such a reassembly of the HGSCbactig data. However, this exercise was notapplied to the genome. Rather, it wasapplied to a single finished high-qualitychromosome, constituting only 1% of thegenome. It is thus misleading for the fol-lowing reasons. First, the assembly prob-lem is 100 times more complex for thegenome than for a single chromosome, asthe complete genome contains approxi-mately 100 times more copies of eachrepetitive element than chromosome 22.Second, the majority of the HGSC datawas in 6–8 kbp bactigs that were some-times overlapping and occasionally misas-sembled, and whose sequence accuracywas as poor as 4% error near the tips. Soassembling a shredding of such sequencemust permit differences in read overlaps,whereas assembling a shredding of a fin-ished sequence need not. Celera’s assem-bler considers all overlaps at 94% orgreater similarity as equivalent (4) anduses the pairing of end-sequence reads asthe principal factors for achieving accu-rate order. Traditional assemblers thatmake local decisions based on the degreeof overlap similarity are intrinsically tooerror prone to be reliable at the scale ofmammalian WGS. Third, unlike the con-tiguous sequence of chromosome 22 usedin the simulation, the HGSC data avail-able in September of 2000 consisted of 5%predraft sequence consisting of 1⫻–3⫻light-shotgun reads of BACs, 75% rough-draft unordered bactigs of BACs derivedfrom 3⫻–5⫻ shotgun data of each BAC,and only 20% finished sequences of indi-vidual BACs (table 2 of ref. 1).†Assembly Simulation with a Real-World Sce-nario Shows No Implicit Reassembly.We re-peated the simulation experiment of WLS,but under a progression of conditions todemonstrate the impact of these real-world factors. With 100% identity (Table1, first row) required for overlap, chromo-some 22 is reconstituted from shreddedreads by Celera’s whole-genome assem-bler to the same degree as in the simula-tion reported by WLS. But when imper-fect overlaps are permitted (94% identity,second row), as is required to truly ac-commodate sequencing errors in theHGSC data, the impact of near-identityrepeats just within chromosome 22 be-comes apparent: a much larger number ofcontigs are generated. When assembled inthe context of the remaining 99% of theSee companion article on page 3712 in issue 6 of volume 99.*To whom reprint requests should be addressed. E-mail:[email protected].†The HGSC data are described by WLS as a 7.5⫻ data set, butit is not a 7.5⫻ random shotgun data set. Different regionsof the genome were represented by BACs that had beensequenced to different fold coverage. Having finished 12⫻sequence in one part of the genome does not improve theresult in regions where there is only 2⫻ or no data at all.Table 1. Shredded data does not inherently reassembleData setOverlapcriterionNo. ofcontigsMean size,kbpN50 size,*kbp2⫻ shred of chromosome 22 100 781 43.2 2,488.52⫻ shred of chromosome 22 94 2,433 13.8 256.0Reconstruction of chromosome 22ina2⫻ shred of all HGSC data94 10,142 3.6 20.42⫻ shred of all HGSC data 94 2,081,677 1.7 6.8In isolation, a perfect 2⫻ shred of chromosome 22 reassembles well. In the context of the entire genomeand when a provision is made for imperfect overlaps, the degree of reassembly is much lower.*Refers to the minimum length L such that 50% of all nucleotides are contained in contigs of length ⱖL.www.pnas.org兾cgi兾doi兾10.1073兾pnas.092136699 PNAS兩April 2, 2002兩vol. 99兩no. 7兩4145–4146COMMENTARYgenome (third row), the reassembled se-quence for chromosome 22 is even morefractured. Finally, if one looks at contigsizes over a shredding of all of the HGSCdata, 80% of which is rough draft (fourthrow), the picture is even worse. When one(i) permits error in the overlaps, (ii) ex-pands the problem to 100% of the ge-nome, (iii) considers that most of


View Full Document

Stanford CS 262 - Study Notes

Documents in this Course
Lecture 8

Lecture 8

38 pages

Lecture 7

Lecture 7

27 pages

Lecture 4

Lecture 4

12 pages

Lecture 1

Lecture 1

11 pages

Biology

Biology

54 pages

Lecture 7

Lecture 7

45 pages

Load more
Download Study Notes
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Study Notes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Study Notes 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?