Linear models to discover cis regulatory modules Statistics 246 Spring 2006 Week 16 Lecture 1 1 Finding motifs modules using gene expression data Linear regression methods Model log gene expression or log ratios of gene expression from microarray data as a linear function of one or more known or potential TFBS variables based on the gene s regulatory region e g motif counts PWM scores ChIP chip quantities or transformations thereof The idea is to select the best individual or sets of variables and hope that these correspond to motifs of cis regulatory modules for the gene expression patterns exhibited in the microarray data These methods may be seen as variants or extensions of the logistic modelling in Wasserman and Fickett 1998 discussed in Week 13 Lecture 2 2 A beginning simple linear models Bussemaker et al 2001 suggested the following idea For gene g let yg be the absolute or relative mRNA expression level log scale from a microarray experiment and let Ng be the number of instances of a known or possible motif in its cis regulatory region suitably defined They fitted linear models of the form yg d f Ng g where d represents a baseline expression level when no motifs are present in the region f is the increase or decrease in expression level per motif instance and g is the residual error term for gene g In practice we would consider the significance of estimated coefficients and deal in some way 3 with the multiple testing Here I will not discuss these issues Simple linear cont Of course Bussemaker et al 2001 didn t stop with just one motif variable Ng contributing to a single gene s log relative expression they went on to consider multiple regression models with several motif counts these having the form yg d f m N mg g m M g Here Nmg is the count for motif m in the regulatory region of gene g and Mg is the set of motifs contributing to the its regulation This is where we might find CRMs In what follows please bear in mind this natural next step as I won t be discussing that explicitly either 4 Alternatives Conlon et al 2002 used PWM scores rather than motif counts mentioned by Bussemaker et al 2001 This is of course only possible for known motifs Keles et al 2004 used logic regression Das et al 2004 used multivariate spline models Gao et al 2004 used ChIP occupancy log ratios not sequence information as predictors of expression We now look briefly at each of these before returning to the simplest model I ll omit the details e g of how a candidate set of motifs for a given gene was obtained how models were selected etc 5 Alternatives 1 More fully Conlon et al 2002 used the PWM score Pr x in Sg log 2 x X wg Pr x 0 y g dm f m Smg g where is the probability matrix of a motif m of width w 0 is the 3rd order Markov model estimated from all of the intergenic sequences and Xwg is the set of all w mers in a suitable upstream sequence of gene g e g 600 or 3 000 bp Comment in my view the log and the above should probably be interchanged and max taken instead of 6 Alternatives 2 Keles et al 2004 use logic regression i e they use Boolean expressions as covariates Let us write I A for the indicator function of A where I A 1 when A is true I A 0 otherwise where A is a statement of the form there is a binding site for TF t in the regulatory region of a given gene g Examples of logic trees are below or and B C I A or B and C and or A A B and C D I A or B and Cc and D 7 Alternatives 3 Das et al 2004 use the program MARS multivariate adaptive regression splines which builds the linear model in terms of linear splines and their products If we define x 0 as follows x if x 0 x 0 0 o w then MARS builds up from the following components n 1 0 2 n 0 1 2 n 8 Alternatives 4 Gao et al 2004 used the ChIP occupancy log ratios as predictors for expression Their linear model for the log expression ratio yga of gene g in microarray a is y ga da f ta Btg ga t where Btg is the ChIP log ratio in the promoter region of gene g obtained from the ChIP experiment for TF t These authors used a 1 751 micro array experiments and t 1 113 transcription factors 9 Comment 1 The first three extensions were developed for two reasons a the unrealism of the simple additive model and b the need to consider information on multiple interacting TFBSs i e to include interactions in the model From the statistical perspective this is a potentially massive model selection problem depending on how many sites or candidate TFBSs are involved Up for discussion are the model classes linear terms Boolean expressions products of linear splines etc the model selection method search algorithm model comparison criterion and the method for determining which terms in the model are to be taken seriously significance cross validation There is a lot of room for more work here including data analysis We should ask what the data look like 10 Comment 2 Another point is this With the exception of the paper Gao et al 2004 authors in this field have so far studied only single microarrays or perhaps pairs for relative expression but not whole experiments In some cases this might be biologically reasonable as the role of a motif may be quite specific e g to a time or a component of an experiment But frequently it will be plausible that more complex functions of the microarray measurements will be indicative of regulation of a gene by a TF across an entire experiment We ll see examples Nancy will continue this theme on Thursday 11 Spellman s yeast cell cycle experiment M Mitotic phase Gi Gaps i 1 2 S DNA Synthesis 12 Spellman s microarray data They span 2 cell cycles 18 time points in the interval of 0 119 minutes 7 minute intervals We ll discuss the alpha arrest experiment where the mRNA labelled red Cy3 is from cells synchronized by adding the alpha factor while that labelled green Cy3 is from unsynchronized cells The expression ratio for ygt for gene g at time t is thus ygt log2 Synchronized t Asynchronous t Let s look at some plots of these data and then turn to motifs and plots of gene expression vs motif counts 13 14 Caption to preceding Figure 1 from Spellman et al 1998 Figure 1 Gene expression during the yeast cell cycle Genes correspond to the rows and the time points of each experiment are the columns The ratio of induction repression is shown for each gene such that the magnitude is indicated by the intensity …
View Full Document