Berkeley STATISTICS 246 - Linear models to discover cis-regulatory modules - D2661451

Home> Schools> University of California, Berkeley> (STATISTICS) > STATISTICS 246> Linear models to discover cis-regulatory modules

DOC PREVIEW

Berkeley STATISTICS 246 - Linear models to discover cis-regulatory modules

School name University of California, Berkeley

Course Statistics 246- Statistical Genetics

Pages 63

This preview shows page 1-2-3-4-29-30-31-32-33-60-61-62-63 out of 63 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 63 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 63 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 63 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 63 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 63 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 63 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 63 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 63 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 63 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 63 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 63 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 63 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 63 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 63 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

1Linear models to discovercis-regulatory modulesStatistics 246 Spring 2006Week 16 Lecture 12Finding motifs/modules usinggene expression data Linear regression methods Model log gene expression (or log ratios of gene expression)from microarray data as a linear function of one or moreknown or potential TFBS variables based on the gene’sregulatory region, e.g. motif counts, PWM scores, ChIP-chipquantities, or transformations thereof). The idea is to selectthe best individual or sets of variables, and hope that thesecorrespond to motifs of cis-regulatory modules for the geneexpression patterns exhibited in the microarray data. These methods may be seen as variants or extensions ofthe logistic modelling in Wasserman and Fickett (1998)discussed in Week 13 Lecture 2.3A beginning: simple linear models Bussemaker et al (2001) suggested the following idea. For gene g , let yg be the absolute or relative mRNAexpression level (log scale) from a microarray experiment,and let Ng be the number of instances of a known or possiblemotif in its cis-regulatory region (suitably defined). They fittedlinear models of the form yg = d + f×Ng + εg where d represents a baseline expression level when nomotifs are present in the region, f is the increase or decreasein expression level, per motif instance, and εg is the residual“error” term for gene g. In practice, we would consider thesignificance of estimated coefficients, and deal in some waywith the multiple testing. Here I will not discuss these issues.4Simple linear, cont. Of course Bussemaker et al (2001) didn’t stop with just onemotif variable Ng contributing to a single gene’s log (relative)expression, they went on to consider multiple regressionmodels with several motif counts, these having the form Here Nmg is the count for motif m in the regulatory region ofgene g, and Mg is the set of motifs contributing to the itsregulation. This is where we might find CRMs. In what follows, please bear in mind this natural next step,as I won’t be discussing that explicitly either.€ yg= d + fmm ∈Mg∑⋅ Nmg+εg.5Alternatives• Conlon et al (2002) used PWM scores rather than motifcounts [mentioned by Bussemaker et al (2001)] This is of course only possible for known motifs.• Keles et al (2004) used logic regression.• Das et al (2004) used multivariate spline models.•Gao et al (2004) used ChIP occupancy log-ratios, notsequence information, as predictors of expression. We now look briefly at each of these, before returning to thesimplest model. I’ll omit the details, e.g. of how a candidateset of motifs for a given gene was obtained, how models wereselected, etc.6Alternatives, 1 More fully, Conlon et al (2002) used the PWM score where θ is the probability matrix of a motif m of width w;θ0 is the 3rd-order Markov model estimated from all of theintergenic sequences; and Xwg is the set of all w-mers in asuitable upstream sequence of gene g, e.g. 600 or 3,000 bp. Comment: in my view the log and the ∑ above shouldprobably be interchanged, and max taken instead of ∑.€ Sg= log2x ∈Xwg∑Pr(xθ)Pr(xθ0)        in€ yg= dm+ fm× Smg+εg,7Alternatives, 2 Keles et al (2004) use logic regression, i.e. they useBoolean expressions as covariates. Let us write I(A) for theindicator function of A, where I(A) = 1 when A is true I(A) =0 otherwise, where A is a statement of the form “there is abinding site for TF t in the regulatory region of a given geneg.” Examples of logic trees are below.I((A or B) and (Cc and D))A BorC DandandAB CandorI(A or(B and C))8Alternatives, 3 Das et al (2004) use the program MARS (multivariateadaptive regression splines), which builds the linearmodel in terms of linear splines and their products. If wedefine θ(x,0) as follows: then MARS builds up from the following components:1ξ2ξn≥= .. ,00 if ,)0,(woxxxθθ(n-ξ1,0)θ(ξ2 -n,0)9Alternatives, 4 Gao et al (2004) used the ChIP occupancy log-ratiosas predictors for expression. Their linear model for thelog expression (ratio) yga of gene g in microarray a is where Btg is the ChIP log-ratio in the promoter regionof gene g obtained from the ChIP experiment for TF t.These authors used a=1,…,751 (micro)arrayexperiments and t=1,…,113 transcription factors.€ yga= da+ ftat∑⋅ Btg+εga.10Comment, 1 The first three extensions were developed for two reasons:a) the unrealism of the simple additive model; andb) the need to consider information on multiple interactingTFBSs, i.e. to include interactions in the model. From the statistical perspective, this is a potentially massivemodel selection problem, depending on how many sites orcandidate TFBSs are involved. Up for discussion are themodel classes (linear terms, Boolean expressions,products of linear splines, etc.), the model selectionmethod (search algorithm, model comparison criterion),and the method for determining which terms in the modelare to be taken seriously (significance? cross-validation?). There is a lot of room for more work here, including dataanalysis. We should ask what the data look like.11Comment, 2 Another point is this. With the exception of the paper Gao et al (2004), authors inthis field have so far studied only single microarrays, orperhaps pairs (for relative expression), but not wholeexperiments. In some cases, this might be biologically reasonable, as therole of a motif may be quite specific, e.g. to a time or acomponent of an experiment. But frequently it will beplausible that more complex functions of the microarraymeasurements will be indicative of regulation of a gene bya TF across an entire experiment. We’ll see examples.Nancy will continue this theme on Thursday.12Spellman’s yeast cell cycle experimentM = Mitotic phase; Gi = Gaps i=1,2; S = (DNA)Synthesis13Spellman’s microarray data They span 2 cell cycles, 18 time points in the interval of[0,119] minutes @ 7-minute intervals. We’ll discuss thealpha-arrest experiment, where the mRNA labelled red(Cy3) is from cells synchronized by adding the alpha factor,while that labelled green (Cy3) is from unsynchronized cells. The expression ratio for ygt for gene g at time t is thus ygt = log2 [Synchronized(t) / Asynchronous(t)] Let’s look at some plots of these data, and then turn to motifsand plots of gene expression vs motif counts.1415Caption to preceding Figure 1from Spellman et al (1998) Figure 1. ! Gene expression

View Full Document