Estimating expression differences in cDNA microarray experiments Statistics 246 Spring 2002 Week 8 Lecture 1 Some motherhood statements Important aspects of a statistical analysis include Tentatively separating systematic from random sources of variation Removing the former and quantifying the latter when the system is in control Identifying and dealing with the most relevant source of variation in subsequent analyses Only if this is done can we hope to make more or less valid probability statements The simplest cDNA microarray data analysis problem is identifying differentially expressed genes using one slide This is a common enough hope Efforts are frequently successful It is not hard to do by eye The problem is probably beyond formal statistical inference valid p values etc for the foreseeable future why In the next two slides genes found to be up or downregulated in an 8 treatment Srb1 over expression versus 8 control comparison are indicated in red and green respectivley What do we see Matt Callow s Srb1 dataset 5 Newton s and Chen s single slide method Matt Callow s Srb1 dataset 8 Newton s Sapir Churchill s and Chen s single slide method The second simplest cDNA microarray data analysis problem is identifying differentially expressed genes using replicated slides There are a number of different aspects First between slide normalization then What should we look at averages SDs tstatistics other summaries How should we look at them Can we make valid probability statements A report on work in progress begin with an example Apo AI experiment Matt Callow LBNL Goal To identify genes with altered expression in the livers of Apo AI knock out mice T compared to inbred C57Bl 6 control mice C 8 treatment mice and 8 control mice 16 hybridizations liver mRNA from each of the 16 mice Ti Ci is labelled with Cy5 while pooled liver mRNA from the control mice C is labelled with Cy3 Probes 6 000 cDNAs genes including 200 related to lipid metabolism Which genes have changed When permutation testing possible 1 For each gene and each hybridisation 8 ko 8 ctl use M log2 R G 2 For each gene form the t statistic average of 8 ko Ms average of 8 ctl Ms sqrt 1 8 SD of 8 ko Ms 2 SD of 8 ctl Ms 2 3 Form a histogram of 6 000 t values 4 Do a normal q q plot look for values off the line 5 Permutation testing next lecture 6 Adjust for multiple testing next lecture Histogram normal q q plot of t statistics ApoA1 What is a normal q q plot We have a random sample say ti i 1 n which we believe might come from a normal distribution If it did then for suitable and ti i 1 n would be uniformly distributed on 0 1 why where is the standard normal c d f Denoting the order statistics of the t sample by t 1 t 2 t n we can then see that t i should be approximately i n why With this in mind we d expect t i to be about 1 i n why Thus if we plot t i against 1 i 1 2 n 1 we might expect to see a straight line of slope about with intercept about The 1 2 and 1 in numerator and denominator of the i n are to avoid problems at the extremes This is our normal quantile quantile plot the i n being a quantile of the uniform and the 1 being that of the normal Why a normal q q plot One of the things we want to do with our t statistics is roughly speaking to identify the extreme ones It is natural to rank them but how extreme is extreme Since the sample sizes here are not too small two samples of 8 each gives 16 terms in the difference of the means approximate normality is not an unreasonable expectation for the null marginal distribution Converting ranked t s into a normal q q plot is a great way to see the extremes they are the ones that are off the line at one end or another This technique is particularly helpful when we have thousands of values Of course we can t expect all differentially expressed genes to stand out as extremes many will be masked by more extreme random variation which is a big problem in this context See next lecture for a discussion of these issues gene t index statistic 2139 22 Apo AI 4117 13 EST weakly sim to STEROL DESATURASE 5330 12 CATECHOL O METHYLTRANSFERASE 1731 11 Apo CIII 538 11 EST highly sim to Apo AI 1489 9 1 EST 2526 8 3 Highly sim to Apo CIII precursor 4916 7 7 similar to yeast sterol desaturase 941 4 7 2000 3 1 5867 4 2 4608 4 8 948 4 7 5577 4 5 Gene annotation Useful plots of t statistics Which genes have changed Permutation testing not possible Our current approach is to use averages SDs t statistics and a new statistic we call B inspired by empirical Bayes We hope in due course to calibrate B and use that as our main tool We begin with the motivation using data from a study in which each slide was replicated four times Results from 4 replicates Points to note One set green has a high average M but also a high variance and a low t Another pale blue has an average M near zero but a very small variance leading to a large negative t A third dark blue has a modest average M and a low variance leading to a high positive t A fourth purple has a moderate average M and a moderate variance leading to a small t Another pair yellow red have moderate average Ms and middling variances and moderately large ts Which do we regard most favourably Let s look at M and t jointly M t t M Sets defined by cut offs from the Apo AI ko experiment M t t M Results from the Apo AI ko experiment M t t M Apo AI experiment t vs average A T B t M B t B Results from SR BI transgenic experiment M B t M B t B B t M Results from SR BI transgenic experiment An empirical Bayes story Using average M alone we ignore useful information in the SD across replicated Some large values are large because of outliers Using t alone we are liable to be misled by very small SDs With thousands of genes some SDs will be very small Formal testing can sort out these issues for us but if we simply want to rank what should we rank on One approach SAM is to inflate the SDs slightly Another approach can be based on the following empirical Bayes story There are a number of variants Suppose that our M values are independently and normally distributed and that a proportion p of genes are differentially expressed i e have M s with non zero means Further suppose that the variances and means of …
View Full Document