Sequence census methods for functional genomics Figure 1NATURE METHODS | VOL.5 NO.1 | JANUARY 2008 | 19COMMENTARY | SPECIAL FEATURESequence census methods for functional genomicsBarbara Wold & Richard M MyersNext-generation sequencing technologies are beginning to facilitate genome sequencing. But in addition, new applications and new assay concepts have emerged that are vastly increasing our ability to understand genome function.When Thomas Edison invented the pho-nograph, playing music was well down his list of possible uses. Something similar may now be happening in the genome world. A new generation of massively parallel DNA sequencing platforms is here (see Primer). They aim to replace the workhorse capillary sequencing systems that brought usvery beautifullythe human genome sequence. These machines, led by 454 and Illumina (formerly Solexa), and lately joined by ABI1, have emerged in DNA sequencing centers over the past two years, promising vastly more sequence (>1 gigabase of sequence per run) than standard capillary-based technol-ogy can produce. Still other new machines are on the way. Their development is driven by the US National Institutes of Health and National Human Genome Research Institute challenges for DNA sequencing at costs of $100,000 and then of less than $1,000 per human or human-size genome. ‘If you build them, we will buy them’ was implied, and other near-infinite sequencing appetites such as those of microbial metagenomics researchers have added fuel. As hoped, the new instruments are being explored in the world’s genome centers for rapid and cheap genome sequencing. Read length limits, error rates and assembly algorithm issues, among other problems, mean that these new kids on the sequencing block have not fully reduced whole-genome sequencing to practice. Not yet, at least.But something different and remarkable happened on the way to inexpensive whole-genome sequencing: as music was to Edison’s phonograph, a new family of ‘sequence cen-sus’ counting assays is to this new generation of DNA sequencers. If you need to take the measure of an RNA or DNA ‘ome’, microar-rays are no longer the only way to do it. A new and rapidly growing family of assays for measuring the global, genome-wide profiles of mRNAs, small RNAs, transcription-factor binding, chromatin structure, DNase hyper-sensitivity and DNA methylation status are now being implemented by applying one of the massively parallel, ultrahigh-throughput DNA sequencing systems.The principle behind these global ‘sequence census’ methods is disarmingly simple: to learn the content of a complex nucleic acid sample, just sequence it. Sequence it directlywithout bacterial cloning as a prerequisiteand do it with the aim of get-ting just enough sequence to assign the site of origin in the genome for each read rather than trying to determine its entire sequence. A sin-gle short sequence read (or sometimes a pair of reads, one from each end) is determined for millions of nucleic acid molecules from a biological sample. You need not sequence the entirety of each molecule in the starting mix, because a small snippet of 25–35 base pairs allows you to use informatics to identify the location of each fragment in the reference genome. Once mapped, you ‘count’ the hits and analyze their distribution through-out the genome. The key for these uses is the very high number of individual reads, each corresponding to a different molecule in the starting sample. Conceptually, this builds on the ideas behind earlier methods such as serial analysis of gene expression (SAGE) and massively parallel signature sequencing (MPSS)2,3, with the new assays being substantially less expensive, more gen-eral and capable of delivering vastly more information.These next-generation, ‘Seq-based’ meth-ods are a natural fit for functional genom-ics applications because they generate huge numbers of short sequencing reads quickly and cheaply, and, critically for their success, they all focus on a ‘reduced genome’ input. That is, they do not require sequencing an entire large genome but rather a small frac-tion of the total that appears as mRNA, as methylated or unmethylated fragments, as DNA or RNA bound by specific proteins, or DNA regions that are hypersensitive to nucleases. The platforms acquire sequence data from amplified single DNA fragments rather than from fragments cloned in plas-mids (see Primer). Although cost is a per-petual moving targetand there are likely to be improvements and new competing sequencersit is now possible to do a thor-ough measurement of a nucleic acid profile for the same cost or less than that of using hybridization to microarrays. The Seq-based methods bypass some longstanding technical problems of microarrays, includ-ing a requirement to synthesize microarrays with millions of DNA probes that does not scale well with large genome size, consider-able cross-hybridization, and difficulties with quantitation owing to the continuous nature the hybridization signals. However, these new methods do not solve everything. For example, even though a larger fraction of the genome is accessible by Seq-based methods than by contemporary microarray hybridization, still 15–20% of the reads in the human genome cannot be unambigu-ously mapped to a single location because they occur more than once in the genome.Barbara Wold is in the Division of Biology, California Institute of Technology, Pasadena, California 91125, USA. Richard M. Myers is in the Department of Genetics, Stanford University School of Medicine, Stanford, California 94305, USA. e-mail: [email protected] or [email protected] PUBLISHED ONLINE 19 DECEMBER 2007; DOI:10.1038/NMETH1157©2008 Nature Publishing Group http://www.nature.com/naturemethods20 | VOL.5 NO.1 | JANUARY 2008 | NATURE METHODSSPECIAL FEATURE | COMMENTARYThe sequence census application that is farthest along is chromatin immunopre-cipitation, or ChIP, and several publications have appeared describing ChIP-Seq in the past few months5–8 (Fig. 1). The goal of these experiments is to map all in vivo DNA sites occupied by a DNA-binding protein of interest. To do this, an antibody specifically recognizing a DNA binding protein is used to immunoprecipitate the protein that has been cross-linked to its DNA-binding sites in living cells, bringing the bound DNA fragments along for the ride. The protein of interest can be part of the general
View Full Document