DOC PREVIEW
Berkeley STATISTICS 246 - Lecture Notes

This preview shows page 1-2-3-4-5 out of 16 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 16 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 16 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 16 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 16 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 16 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 16 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

1Methods for the discovery ofcis-regulatory modules, 2Statistics 246 Spring 2006Week 14 Lecture 12Finding motifs/modules usinggene expression data• Clustering methods– Assume that co-expression implies co-regulation.– Group genes into disjoint clusters based on similarity in expressionprofile– Run an algorithm such as MEME, AlignAce, MDScan, etc to findshared motifs in upstream regions of the groups of genes– Then seek to cluster your motifs. We start on this today.• Linear regression methods– Model the gene expression (ratio) as a linear function of one or moreTFBS variables (motif counts, PWM scores or more), and select themost relevant subsets. In a way these methods are variants on thatin Wasserman and Fickett (1998). We come to this later.3Using sequence data from the organism ofinterest after clustering genes The sequences are typically from promoter regions of genesbelieved to be co-regulated, and hence to share regulatoryelements. If the organism is yeast, people typically take ~600bp upstream of the transcription start site, whereas if it ismouse or human, 3kb or more might be used. There are several methods in use for identifying sharedbinding sites of sets of sites in unaligned sequences Becauseof its importance, I’ll give a quick historical overview of theGibbs sampler approach to this problem. Programsimplementing this include AlignAce and BioProspector. In general these methods work brilliantly with bacteria,reasonably well with yeast, less well with the fly, and not verywell at all with mouse and human. The issue here is thefrequency of occurrence, length, and degree of conservation,and proximity to the TSS of TFBSs.For a recent review, see Tompa et al, Nat. Biotech. 23 (1) Jan 2005 p.137.4The E. coli CRP datasetLawrence & Reilly, 1990• 18 unaligned DNA sequences, each of length 105 bp.• There is at least one CRP binding site, known experimentally,in each sequence. (We met these sites last week in ourdiscussion of the regulation of lacZ.)• The binding sites are about 16-19 base pairs long, withconsiderable variability in their contents.• We are interested in seeing if we can find and characterizethese sites computationally.• This paper started a stream of research leading to many ofthe methods reviewed in Tompa et al, 2005.The figures in the next few slides are from slides or papers by Jun Liu.5The CRP dataset: actual sites indicated6A first approach to this problem• Was made by Lawrence and Reilly, Proteins:Structure, Function and Genetics, 1990 whointroduced what we call the motif aligment model(next slide). They used the EM algorithm.• The best contemporary implementation of theirmethod is probably in the program MEME. I’ll omitfurther details just now, and turn to• An alternative approach is via the Gibbs sampler,whose approach I’ll discuss in some detail.7Motif alignment modelmotif width = wlength Nka1a2ak•The observed data: K sequences bk = (bk,1, …, bk,Nk)•The missing data: the alignment variable A={a1, a2, …, aK}of motif start positions• Bases in background (= non-motif) positions assumed tofollow a common multinomial with p0 = (p0,a , p0,c , p0,g , p0,t)• Bases in position j in the motif assumed to follow amultinomial distribution with pj = (pj,a , pj,c , pj,g , pj,t), j=1,…,w.• All bases are assumed mutually independent8A simpler problem: one sequence, one site Suppose that the multinomial probabilities {p0 , p1 , … , pw}, are all known,and that we have one observed nucleotide sequence b = (b1, b2, …, bN)known to contain one instance of the motif. Where is it? Denote its startindex by a. Let’s suppose that our prior distribution (πn ) for the start site a is uniformon the indices n=1, …, N-w+1, and let’s update that given the observedsequence b = (bi). That is, we calculate the posterior probability pr(a=m | b) = pr(b | a=m)πm / Σn pr(b | a=n)πn = pr(b | a=m) / Σn pr(b | a=n), since π is uniform.Now each term in the numerator and denominator is a product of p’s withfirst subscript 0, for background, or j = 1, .., w, for position j in the motif.Divide both by We find the following, which you should interpret and remember:€ pr(a = m | b) =pj,bm+ j−1p0,bm+ j−1/pj,bn+ j −1p0,bn+ j −1j=1w∏n=1N −w +1∑j=1w∏∝pj,bm+ j−1p0,bm+ j−1j=1w∏.€ p0,bi.i=1N∏9The Liu-Neuwald-Lawrence Gibbs sampler We are going to suppose that that the pj have independent Dirichlet priors Dir(pj,βj) with parameters βj , j = 0,…,w. Our data will be a set of K sequences bk = (bk,1, …, bk,Nk) ofbases assumed all to have just one instance of the motif.This instance is supposed a priori to be equally likely tobegin at any feasible index, i.e. we have independentuniform priors for the missing motif start indicators A = (ak). Call the sequences Data, and the probabilities Θ. The basicGibbs sampler would involve sampling from the distributionpr(A,Θ | Data), and typically this will be done by alternatingbetween sampling A | Θ,Data, and sampling Θ | A,Data. Inthe present case we skip the sampling of Θ, utilizing what isknown in the MCMC trade as collapsing, simply samplingfrom A |Data. This can done sequence by sequence, andwe now illustrate one step, denoting the indicator set Awithout ak by A[-k].10The predictive update (PU) step, 1 First, note that pr(Data, A |Θ) which we call (*) is simply aproduct of p’s for bases b=a, c, g and t, some with firstsubscript 0 for background, and others with first subscripts j forposition j in the motif, raised to the powers which correspondto their frequencies of occurrence. We are going to calculatepr(ak = m | A[-k] , Data), which is just the integral of the productof the Dirichlet priors by pr(ak = m, A[-k] , Data, Θ), divided bythe sum from m = 1 to Nk - w +1 of such terms . Now thisproduct has exactly the same form as (*), but with differentexponents, these now including the parameters βj,b from thepriors. However, these integrals all end up giving ratios ofproducts of gamma functions, so our calculation boils down tofinding a simple representation of these products. We now turnto this, following Liu et al, JASA 1995.11The PU step, 2 The product of (*) by the Dirichlet priors has the form of base b in position j =0 (for background) and j=1,…,w for themotif, in the sequences other than the kth, or in the kth whenits motif starts at m, respectively. Integrating out the p’s,


View Full Document

Berkeley STATISTICS 246 - Lecture Notes

Documents in this Course
Meiosis

Meiosis

46 pages

Meiosis

Meiosis

47 pages

Load more
Download Lecture Notes
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture Notes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture Notes 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?