DOC PREVIEW
Stanford CS 374 - Study Notes

This preview shows page 1-2-3-4 out of 13 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 13 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 13 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 13 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 13 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 13 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Clustering of DNA Sequences in Human PromotersPeter C. FitzGerald,1Andrey Shlyakhtenko,2Alain A. Mir,2and Charles Vinson2,31Genome Analysis Unit and2Laboratory of Metabolism, National Cancer Institute, National Institutes of Health,Bethesda, Maryland 20892, USAWe have determined the distribution of each of the 65,536 DNA sequences that are eight bases long (8-mer) in a setof 13,010 human genomic promoter sequences aligned relative to the putative transcription start site (TSS). A limitednumber of 8-mers have peaks in their distribution (cluster), and most cluster within 100 bp of the TSS. The 156 DNAsequences exhibiting the greatest statistically significant clustering near the TSS can be placed into nine groups ofrelated sequences. Each group is defined by a consensus sequence, and seven of these consensus sequences are knownbinding sites for the transcription factors (TFs) SP1, NF-Y, ETS, CREB, TBP, USF, and NRF-1. One sequence, which wenamed Clus1, is not a known TF binding site. The ninth sequence group is composed of the strand-specific Kozaksequence that clusters downstream of the TSS. An examination of the co-occurrence of these TF consensus sequencesindicates a positive correlation for most of them except for sequences bound by TBP (the TATA box). HumanmRNA expression data from 29 tissues indicate that the ETS, NRF-1, and Clus1 sequences that cluster arepredominantly found in the promoters of housekeeping genes (e.g., ribosomal genes). In contrast, TATA is moreabundant in the promoters of tissue-specific genes. This analysis identified eight DNA sequences in 5082 promotersthat we suggest are important for regulating gene expression.Vertebrate gene expression is often regulated by the basal pro-moter, which traditionally is defined as being between ⳮ200 bpand the transcription start site (TSS). The DNA sequence proper-ties of basal promoters are poorly described because it is difficultto identify the TSS. Two recent results have helped to resolve thisproblem: (1) RefSeq (Maglott et al. 2000; Pruitt et al. 2000; Pruittand Maglott 2001) sequences have been mapped to their locationin the complete human genome sequence, and (2) TSSs havebeen experimentally verified for 7889 genes by using cDNA syn-thesis methods that identify the 5⬘ CAP site (Suzuki et al. 2002).We have combined these data to assemble genomic DNA se-quences that are putative promoter regions for 13,010 genesaligned relative to the putative TSS. We have examined thesealigned sequences for 8-mers that are preferentially localized rela-tive to the TSS, namely, clusters.A fundamental question in gene expression studies is to de-termine which DNA sequences that are bound by TFs are biologi-cally relevant. Often, the same DNA sequence is functional inone context but not in another. We reasoned that if a DNA se-quence clusters relative to the TSS, the DNA sequences that are i nthe cluster have a high likelihood of being biologically signifi-cant. In human promoters the CAAT box, SP1, and TATA box arerecognized by the constitutive transcription factors NF-Y, SP1,and TBP, respectively, and are thought to be localized near theTSS (Breathnach and Chambon 1981). Recently, a genome-wideanalysis has demonstrated that the CRE sequence clusters in hu-man promoters (Conkright et al. 2003).To identify additional DNA sequences that localize near theTSS and thus may be biologically important, we determined thedistribution of each of the 65,536 8-mer DNA sequences in13,010 human promoters sequences from ⳮ2500 to 500 bp rela-tive to the TSS. A detailed analysis of the 8-mers with the mostsignificant clustering indicates that they primarily representvariations of only nine DNA consensus sequences. Eight motifscluster between ⳮ100 and the TSS. They include (1) TF bindingsites that have been previously suggested to cluster within thepromoter (CAAT, SP1, CREB, and TATA); (2) TF binding sites thatwere not known to localize in the core promoter region, ETS,NRF-1, and USF; and (3) a single DNA sequence, designatedClus1, that is not a known TF binding site. The ninth motif isthe Kozak sequence that clusters downstream of the TSS. Weobserve correlations between the presence of DNA sequences thatcluster in promoters and the mRNA expression properties andfunction of genes.RESULTSWe combined the cDNA data for RefSeq genes (Maglott et al.2000; Pruitt et al. 2000; Pruitt and Maglott 2001) with TSS data(Suzuki et al. 2002) and mapped the 5⬘ end of these cDNA se-quences onto assembled human genomic DNA sequences(Lander et al. 2001) for 13,010 genes. For this study, we alignedthese human promoter genomic sequences relative to the puta-tive TSS. We then analyzed the distribution of DNA sequences(2-mers to 8-mers) between ⳮ1000 and 500 bp relative to theputative TSS and, as a control set, the region between ⳮ2500 andⳮ1000 bp. The 1500-bp regions were divided into 75 bins. Eachbin contains 20 bp, where bin 1 is from ⳮ1000 to ⳮ981 bp, andbin 51 is from 1 bp to 20 bp. We determined the number of timesthat a particular DNA sequence occurred in each 20-bp bin for all13,010 promoter sequences.Distribution of Dinucleotide Pairs in PromotersInitially, we determined the distributions of each dinucleo-tide (2-mer) in the set of 13,010 human promoter sequences.We determined the position of each 2-mer on the DNA cod-ing strand across the 1500 bp, between ⳮ1000 and 500 bp, andplotted the results as a frequency histogram. Three general dis-tributions are observed: (1) a peak near the TSS for the 2-merscontaining G and/or C (GC, CG, GG, and CC), (2) a valley nearthe TSS for the 2-mers containing A and/or T (AT, TA, TT, andAA), and (3) no preference for the remaining 2-mers (Fig. 1).Although peaking around the TSS, the CG sequence, which canbe methylated on the C base, is the least abundant outside thepromoter region, as is observed in genomic DNA (Hapgood et al.2001).3Corresponding author.E-MAIL [email protected]; FAX (301) 496-8419.Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr.1953904. Article published online before print in July 2004.Letter1562 Genome Research 14:1562–1574 ©2004 by Cold Spring Harbor Laboratory Press ISSN 1088-9051/04; www.genome.orgwww.genome.orgDistribution of All 8-mer DNA SequencesTo identify DNA sequences that cluster relative to the TSS, wedetermined the distribution of all sequences ranging from 2-mersto 8-mers in this set of 13,010 promoter sequences. As the lengthof the DNA


View Full Document

Stanford CS 374 - Study Notes

Documents in this Course
Probcons

Probcons

42 pages

ProtoMap

ProtoMap

19 pages

Lecture 3

Lecture 3

16 pages

Load more
Download Study Notes
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Study Notes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Study Notes 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?