Methods for the discovery of cis regulatory modules 2 Statistics 246 Week 14 Spring 2006 Lecture 1 1 Finding motifs modules using gene expression data Clustering methods Assume that co expression implies co regulation Group genes into disjoint clusters based on similarity in expression profile Run an algorithm such as MEME AlignAce MDScan etc to find shared motifs in upstream regions of the groups of genes Then seek to cluster your motifs We start on this today Linear regression methods Model the gene expression ratio as a linear function of one or more TFBS variables motif counts PWM scores or more and select the most relevant subsets In a way these methods are variants on that in Wasserman and Fickett 1998 We come to this later 2 Using sequence data from the organism of interest after clustering genes The sequences are typically from promoter regions of genes believed to be co regulated and hence to share regulatory elements If the organism is yeast people typically take 600 bp upstream of the transcription start site whereas if it is mouse or human 3kb or more might be used There are several methods in use for identifying shared binding sites of sets of sites in unaligned sequences Because of its importance I ll give a quick historical overview of the Gibbs sampler approach to this problem Programs implementing this include AlignAce and BioProspector In general these methods work brilliantly with bacteria reasonably well with yeast less well with the fly and not very well at all with mouse and human The issue here is the frequency of occurrence length and degree of conservation and proximity to the TSS of TFBSs 3 For a recent review see Tompa et al Nat Biotech 23 1 Jan 2005 p 137 The E coli CRP dataset Lawrence Reilly 1990 18 unaligned DNA sequences each of length 105 bp There is at least one CRP binding site known experimentally in each sequence We met these sites last week in our discussion of the regulation of lacZ The binding sites are about 16 19 base pairs long with considerable variability in their contents We are interested in seeing if we can find and characterize these sites computationally This paper started a stream of research leading to many of the methods reviewed in Tompa et al 2005 The figures in the next few slides are from slides or papers by Jun Liu 4 The CRP dataset actual sites indicated 5 A first approach to this problem Was made by Lawrence and Reilly Proteins Structure Function and Genetics 1990 who introduced what we call the motif aligment model next slide They used the EM algorithm The best contemporary implementation of their method is probably in the program MEME I ll omit further details just now and turn to An alternative approach is via the Gibbs sampler whose approach I ll discuss in some detail 6 Motif alignment model a1 motif a2 width w ak length Nk The observed data K sequences bk bk 1 bk Nk The missing data the alignment variable A a1 a2 aK of motif start positions Bases in background non motif positions assumed to follow a common multinomial with p0 p0 a p0 c p0 g p0 t Bases in position j in the motif assumed to follow a multinomial distribution with pj pj a pj c pj g pj t j 1 w All bases are assumed mutually independent 7 A simpler problem one sequence one site Suppose that the multinomial probabilities p0 p1 pw are all known and that we have one observed nucleotide sequence b b1 b2 bN known to contain one instance of the motif Where is it Denote its start index by a Let s suppose that our prior distribution n for the start site a is uniform on the indices n 1 N w 1 and let s update that given the observed sequence b bi That is we calculate the posterior probability pr a m b pr b a m m n pr b a n n pr b a m n pr b a n since is uniform Now each term in the numerator and denominator is a product of p s with first subscript 0 for background or j 1 w for position j in the motif Divide both by N p 0 b i i 1 We find the following which you should interpret and remember w pr a m b j 1 p j bm j 1 p0 bm j 1 N w 1 n 1 w p j bn j 1 p j 1 0 b n j 1 w j 1 p j bm j 1 p0 bm j 1 8 The Liu Neuwald Lawrence Gibbs sampler We are going to suppose that that the pj have independent Dirichlet priors Dir pj j with parameters j j 0 w Our data will be a set of K sequences bk bk 1 bk Nk of bases assumed all to have just one instance of the motif This instance is supposed a priori to be equally likely to begin at any feasible index i e we have independent uniform priors for the missing motif start indicators A ak Call the sequences Data and the probabilities The basic Gibbs sampler would involve sampling from the distribution pr A Data and typically this will be done by alternating between sampling A Data and sampling A Data In the present case we skip the sampling of utilizing what is known in the MCMC trade as collapsing simply sampling from A Data This can done sequence by sequence and we now illustrate one step denoting the indicator set A 9 without ak by A k The predictive update PU step 1 First note that pr Data A which we call is simply a product of p s for bases b a c g and t some with first subscript 0 for background and others with first subscripts j for position j in the motif raised to the powers which correspond to their frequencies of occurrence We are going to calculate pr ak m A k Data which is just the integral of the product of the Dirichlet priors by pr ak m A k Data divided by the sum from m 1 to Nk w 1 of such terms Now this product has exactly the same form as but with different exponents these now including the parameters j b from the priors However these integrals all end up giving ratios of products of gamma functions so our calculation boils down to finding a simple representation of these products We now turn to this following Liu et al JASA 1995 10 The PU step 2 The product of by the Dirichlet priors has the form w j b 1 j b j 0 b a c g t p C j bk j b p C k m j b j b p k m where C k and C j b j b are the counts of base b in position j 0 for background and j 1 w for the motif in the sequences other than the kth …
View Full Document