What types of genomic data sets are available?Abstract | Integrating results from diverse experiments is an essential process in our effort to understand the logic of complex systems, such as development, homeostasis and responses to the environment. With the advent of high-throughput methods — including genome-wide association (GWA) studies, chromatin immunoprecipitation followed by sequencing (ChIP–seq) and RNA sequencing (RNA–seq) — acquisition of genome-scale data has never been easier. Epigenomics, transcriptomics, proteomics and genomics each provide an insightful, and yet one-dimensional, view of genome function; integrative analysis promises a unified, global view. However, the large amount of information and diverse technology platforms pose multiple challenges for data access and processing. This Review discusses emerging issues and strategies related to data integration in the era of next-generation genomics.Box 1 | Collaborative projects and technology developmentWhy perform integrative genomic analysis?Figure 1 | Annotating the genome through detecting transcription-factor binding sites and histone-modification states. Promoters can be mapped by the localization of general transcription machinery and transcription factors (TFs), such as RNA polymerase II (RNAPII) or transcription initiation factor TFIID-associated factor 1 (TAF1), or by the localization of histone 3 lysine 4 trimethylation (H3K4me3). The bodies of transcribed genes and non-coding RNAs are marked by H3K36me3. Enhancers can be found by distal TF binding sites or by H3K4me1. This modification often coincides with H3K4me2, which has been shown to be necessary to recruit pioneering TFs to enhancer elements121. In addition, H3K4me1 sites overlap acetylated histone lysines, in agreement with acetylation islands outside promoters identifying functional enhancer elements122,123. Insulators are bound by CCCTC-binding factor (CTCF). Nucleosomes are shown as cylinders and example histone tails are in green. Various TFs are shown as coloured shapes. TFs bound to the insulator include CTCF and subunits of cohesin.Approaches to an integrative analysisFigure 2 | Identification of regulatory SNPs. The sequence of a transcription factor (TF) binding site is shown with the position of an A/T polymorphism. By integrating chromatin signatures of enhancers or TF binding sites with SNP data, SNPs falling with the region would be predicted as regulatory SNPs. These could then be correlated to changes in gene expression. H3K4me1, histone 3 lysine 4 monomethylation.Box 2 | ClusteringFigure 3 | Data visualization. The University of California-Santa Cruz (UCSC) Genome Browser is a tool for viewing genomic data sets. A vast amount of data is available for viewing through this browser. This example from the browser shows numerous data types in K562 cells from the ENCODE Consortium. A random gene was selected — katanin p60 subunit A-like 1 (KATNAL1) — that shows several points that can be identified by using this tool. The promoter has a typical chromatin structure (a peak of histone 3 lysine 4 trimethylation (H3K4me3) between the bimodal peaks of H3K4me1), is bound by RNA polymerase II (RNAPII) and is DNase hypersensitive. The gene is transcribed, as indicated by RNA sequencing (RNA–seq) data, as well as H3K36me3 localization. The gene lies between two CCCTC-binding factor (CTCF)-bound sites that could be tested for insulator activity. An intronic H3K4me1 peak (highlighted) predicts an enhancer element, corroborated by the DNase I hypersensitivity site peak. There is a broad repressive domain of H3K27me3 downstream, which could have an open chromatin structure in another cell type.Using large-scale data sets for integrative analysisBox 3 | Online tools for integrative analysisFigure 4 | Flow chart for data analysis. This example shows a workflow for the analysis of data from chromatin immunoprecipitation followed by sequencing (ChIP–seq). This analysis can be done by a bench scientist using current resources, and a similar strategy could be used for other types of next-generation sequencing data. Blue boxes show steps that can be performed using Galaxy. Integration or cross-sectioning of data can often be done in the University of California-Santa Cruz (UCSC) Genome Browser or by joining lists in Galaxy (purple box). Downstream steps, such as known motif analysis and Gene Ontology analysis, can be achieved with online or stand-alone tools (orange boxes). Galaxy can also be used to establish analytical pipelines for calling SNPs that could then be integrated into sequencing-based data, such as reads from ChIP–seq. CEAS, Cis-regulatory Element Annotation System; MACS, Model-based Analysis of ChIP–Seq; TSS, transcription start site.Future perspectivesDriven by technological advances, recent years have witnessed a deluge of new methods for interrogating different properties of a cell on a genome-wide scale. Each offers a unique, although complementary, view of genome organization and cellular function. It is expected that integrating these data sets will provide more bio-logical insights than using one data set alone. Thanks to the development of next-generation sequencing (NGS) technologies, the human genome has been mapped in many individuals; the challenge we now face is to under-stand this blueprint and to determine how errors lead to disease. The traditional approach of isolating indi-vidual genes and studying them in a model system is being rapidly replaced by data sets generated by both individual laboratories and large consortia using new high-throughput technologies.Although individual data sets — including genomic, epigenomic, transcriptomic and proteomic infor-mation — are highly informative, integrating them together offers the exciting potential to answer many long-standing questions. For example, what are the functional variants of gene-distal loci identified by association studies? Where are the regulatory elements? And to what extent does the activity of regulatory ele-ments contribute to disease phenotypes or to individual gene expression variation? Therefore, integrative analy-sis has become an essential part of experimental design in the era of next-generation genomics and is no longer the preserve of bioinformaticians. However, with the diversity of the high-throughput data and the seemingly endless analyses that can be performed, data integra-tion is posing challenges for both bench
View Full Document