Berkeley STATISTICS 246 - Lecture Notes - D3039672

Home> Schools> University of California, Berkeley> (STATISTICS) > STATISTICS 246> Lecture Notes

DOC PREVIEW

Berkeley STATISTICS 246 - Lecture Notes

School name University of California, Berkeley

Course Statistics 246- Statistical Genetics

Pages 16

This preview shows page 1-2-3-4-5 out of 16 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 16 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 16 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 16 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 16 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 16 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 16 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

1Methods for the discovery ofcis-regulatory modules, 2Statistics 246 Spring 2006Week 14 Lecture 12Finding motifs/modules usinggene expression data• Clustering methods– Assume that co-expression implies co-regulation.– Group genes into disjoint clusters based on similarity in expressionprofile– Run an algorithm such as MEME, AlignAce, MDScan, etc to findshared motifs in upstream regions of the groups of genes– Then seek to cluster your motifs. We start on this today.• Linear regression methods– Model the gene expression (ratio) as a linear function of one or moreTFBS variables (motif counts, PWM scores or more), and select themost relevant subsets. In a way these methods are variants on thatin Wasserman and Fickett (1998). We come to this later.3Using sequence data from the organism ofinterest after clustering genes The sequences are typically from promoter regions of genesbelieved to be co-regulated, and hence to share regulatoryelements. If the organism is yeast, people typically take ~600bp upstream of the transcription start site, whereas if it ismouse or human, 3kb or more might be used. There are several methods in use for identifying sharedbinding sites of sets of sites in unaligned sequences Becauseof its importance, I’ll give a quick historical overview of theGibbs sampler approach to this problem. Programsimplementing this include AlignAce and BioProspector. In general these methods work brilliantly with bacteria,reasonably well with yeast, less well with the fly, and not verywell at all with mouse and human. The issue here is thefrequency of occurrence, length, and degree of conservation,and proximity to the TSS of TFBSs.For a recent review, see Tompa et al, Nat. Biotech. 23 (1) Jan 2005 p.137.4The E. coli CRP datasetLawrence & Reilly, 1990• 18 unaligned DNA sequences, each of length 105 bp.• There is at least one CRP binding site, known experimentally,in each sequence. (We met these sites last week in ourdiscussion of the regulation of lacZ.)• The binding sites are about 16-19 base pairs long, withconsiderable variability in their contents.• We are interested in seeing if we can find and characterizethese sites computationally.• This paper started a stream of research leading to many ofthe methods reviewed in Tompa et al, 2005.The figures in the next few slides are from slides or papers by Jun Liu.5The CRP dataset: actual sites indicated6A first approach to this problem• Was made by Lawrence and Reilly, Proteins:Structure, Function and Genetics, 1990 whointroduced what we call the motif aligment model(next slide). They used the EM algorithm.• The best contemporary implementation of theirmethod is probably in the program MEME. I’ll omitfurther details just now, and turn to• An alternative approach is via the Gibbs sampler,whose approach I’ll discuss in some detail.7Motif alignment modelmotif width = wlength Nka1a2ak•The observed data: K sequences bk = (bk,1, …, bk,Nk)•The missing data: the alignment variable A={a1, a2, …, aK}of motif start positions• Bases in background (= non-motif) positions assumed tofollow a common multinomial with p0 = (p0,a , p0,c , p0,g , p0,t)• Bases in position j in the motif assumed to follow amultinomial distribution with pj = (pj,a , pj,c , pj,g , pj,t), j=1,…,w.• All bases are assumed mutually independent8A simpler problem: one sequence, one site Suppose that the multinomial probabilities {p0 , p1 , … , pw}, are all known,and that we have one observed nucleotide sequence b = (b1, b2, …, bN)known to contain one instance of the motif. Where is it? Denote its startindex by a. Let’s suppose that our prior distribution (πn ) for the start site a is uniformon the indices n=1, …, N-w+1, and let’s update that given the observedsequence b = (bi). That is, we calculate the posterior probability pr(a=m | b) = pr(b | a=m)πm / Σn pr(b | a=n)πn = pr(b | a=m) / Σn pr(b | a=n), since π is uniform.Now each term in the numerator and denominator is a product of p’s withfirst subscript 0, for background, or j = 1, .., w, for position j in the motif.Divide both by We find the following, which you should interpret and remember:€ pr(a = m | b) =pj,bm+ j−1p0,bm+ j−1/pj,bn+ j −1p0,bn+ j −1j=1w∏n=1N −w +1∑j=1w∏∝pj,bm+ j−1p0,bm+ j−1j=1w∏.€ p0,bi.i=1N∏9The Liu-Neuwald-Lawrence Gibbs sampler We are going to suppose that that the pj have independent Dirichlet priors Dir(pj,βj) with parameters βj , j = 0,…,w. Our data will be a set of K sequences bk = (bk,1, …, bk,Nk) ofbases assumed all to have just one instance of the motif.This instance is supposed a priori to be equally likely tobegin at any feasible index, i.e. we have independentuniform priors for the missing motif start indicators A = (ak). Call the sequences Data, and the probabilities Θ. The basicGibbs sampler would involve sampling from the distributionpr(A,Θ | Data), and typically this will be done by alternatingbetween sampling A | Θ,Data, and sampling Θ | A,Data. Inthe present case we skip the sampling of Θ, utilizing what isknown in the MCMC trade as collapsing, simply samplingfrom A |Data. This can done sequence by sequence, andwe now illustrate one step, denoting the indicator set Awithout ak by A[-k].10The predictive update (PU) step, 1 First, note that pr(Data, A |Θ) which we call (*) is simply aproduct of p’s for bases b=a, c, g and t, some with firstsubscript 0 for background, and others with first subscripts j forposition j in the motif, raised to the powers which correspondto their frequencies of occurrence. We are going to calculatepr(ak = m | A[-k] , Data), which is just the integral of the productof the Dirichlet priors by pr(ak = m, A[-k] , Data, Θ), divided bythe sum from m = 1 to Nk - w +1 of such terms . Now thisproduct has exactly the same form as (*), but with differentexponents, these now including the parameters βj,b from thepriors. However, these integrals all end up giving ratios ofproducts of gamma functions, so our calculation boils down tofinding a simple representation of these products. We now turnto this, following Liu et al, JASA 1995.11The PU step, 2 The product of (*) by the Dirichlet priors has the form of base b in position j =0 (for background) and j=1,…,w for themotif, in the sequences other than the kth, or in the kth whenits motif starts at m, respectively. Integrating out the p’s,

View Full Document