Identifying expression differences in cDNA microarray experiments cont Lecture 21 Statistics 246 April 8 2004 1 An empirical Bayes story Suppose that our M values are independently and normally distributed and that a proportion p of genes are differentially expressed i e have M s with non zero means Further suppose that the variances and means of these are chosen jointly from inverse chi square and normal conjugate priors respectively Genes not differentially expressed have zero means and variances chosen from the same inverse chi squared distribution The scale and d f parameters in the inverse chi square are estimated from the data as is a parameter c connecting the prior for the mean with that for the variances We then calculate for the posterior probability that a given gene is differentially expressed and find it is an increasing function of B over the 2 page where a and c are estimated parameters and p is in the constant Empirical Bayes log posterior odds ratio LOR 2a 2 2 s M n B const log 2 2a s2 M n 1 nc Notice that for large n this approximately t M s 3 B LOR compared with t and M 4 Extensions include dealing with Replicates within and between slides Several effects use a linear model ANOVA are the effects equal Time series selecting genes for trends 5 Summary for the second simplest problem Microarray experiments typically have thousands of genes but only few 1 10 replicates for each gene Averages can be driven by outliers t statistics can be driven by tiny variances B LOR will we hope use information from all the genes combine the best of M and t avoid the problems of M and t Ranking on B could be helpful 6 Use of linear models with cDNA microarray data In many situations we want to combine data from different experiments in a slightly more elaborate manner than simply averaging One way of doing so is via fixed effects linear models where we estimate certain quantities of interest which we call effects for each gene on our slide Typically these estimates may be regarded as approximately normally distributed with common SD and mean zero in the absence of any relevant differential expression In such cases the preceding two strategies qq plots and various combinations of estimated effect cf M standardized 7 estimate cf t both apply Advantages of linear models Analyse all arrays together combining information in optimal way Combined estimation of precision Extensible to arbitrarily complicated experiments Design matrix specifies RNA targets used on arrays Contrast matrix specifies which comparisons are of interest 8 Log ratios or single channel intensities Traditional analyses as here treat log ratios M log R G as the primary data i e we take gene expression measurements as relative An alternative approach treats individual channel intensities R and G as the primary data i e views gene expression measures as absolute Wolfinger Churchill Kerr A single channel approach makes new analyses possible but it make stronger assumptions requires more complex models mixed models in place of ordinary linear models to accommodate correlation between R and G on same spot requires absolute normalization methods 9 Linear models for differential expression A B A B A Ref B A B C Allows all comparisons to be estimated simultaneously 10 Matrix multiplication A B A Ref B A 1 3 B 2 C 11 Slightly larger example WT P1 2 b 5 a1 3 MT P11 a1 b a1 b y1 0 y2 1 y 1 3 y4 0 y 5 0 y6 0 y7 0 WT P21 a1 a2 7 4 1 MT P1 WT P11 6 MT P21 a1 a2 b a1 a2 b 0 1 0 0 0 0 0 0 a1 0 0 1 0 a2 0 1 1 0 b 1 0 0 0 a1b 1 0 0 1 a2b 0 1 1 1 12 Linear model estimates Obtain a linear model for each gene g Estimate model by robust regression least squares or generalized least squares to get coefficients standard deviations standard errors 13 Parallel inference for genes 10 000 40 000 linear models Curse of dimensionality Need to adjust for multiple testing e g control family wise error rate FWE or false discovery rate FDR Boon of parallelism Can borrow information from one gene to another 14 Hierarchical model Normal Model Prior Normality independence assumptions are wrong but convenient resulting methods are useful 15 Posterior statistics Posterior variance estimators Moderated t statistics under null Eliminates large t statistics merely from very small s 16 Posterior Odds Posterior probability of differential expression for any gene is Monotonic function of for constant d Generalization of B mentioned earlier 17 Exercise Prove all the distributional statements made above Summary Analyse data all at once Use standard deviances not just fold changes Use ensemble information to shrink variances Assess differential expression for all comparisons together 18 Appendix slightly more general theoretical development Assume that for each gene E y g X g var y g W g 2 g Also assume that certain contrasts g CT g are of interest The estimators of these contrasts and their estimated covariance matrices are T g C g var g C T Vg C g2 If we let vgj be the jth diagonal element of CTVgC then our assumptions are 2 gj gj g2 N gj v gj g2 sg2 g2 g d2g dg where dg is the residual degrees of freedom in the linear model for gene19g Hierarchical model As before we define a simple hierarchical model to combine information across the genes Prior information on the variances is assumed to come via a prior estimate 2 with d f 1 1 2 2 2 g For any given j we assume non zero with known proportion p gj j pr gj 0 pj For those genes which are differentially expressed prior information is assumed in the form 2 are as before See 2 The moderated t and posterior odds 0 N 0 gj g gj j g http bioinf wehi edu au limma for an R package and more details 20 From single genes to sets of genes All of the foregoing has concerned single genes i e has been a one gene at a time approach although the empirical Bayes idea exploits the fact that there are really lots of genes The problem of discovering sets of genes affected by a treatment at the same time as carrying out some kind of statistical inference remains a challenge What follows is a quick reminder that many people use clustering in this context sometimes ignoring known information about the units being clustered What is relevant here is that clustering can deliver sets of genes associated with treatment or other differences though not to date within a formal statistical framework We then present a brief discussion of a related idea aimed at finding sets of 21 genes which can be examined in much the same way as single genes
View Full Document