Unformatted text preview:

Estimating expression differences incDNA microarray experimentsStatistics 246, Spring 2002Week 8, Lecture 1Some motherhood statements Important aspects of a statistical analysis include:• Tentatively separating systematic from random sources ofvariation• Removing the former and quantifying the latter, when thesystem is in control• Identifying and dealing with the most relevant source ofvariation in subsequent analyses Only if this is done can we hope to make more or less validprobability statementsThe simplest cDNA microarray data analysisproblem is identifying differentially expressedgenes using one slide• This is a common enough hope• Efforts are frequently successful• It is not hard to do by eye• The problem is probably beyond formal statistical inference(valid p-values, etc) for the foreseeable future….why? In the next two slides, genes found to be up- or down-regulated in an 8 treatment (Srb1 over-expression) versus8 control comparison are indicated in red and green,respectivley. What do we see?Matt Callow’s Srb1 dataset (#5). Newton’s and Chen’s single slide methodMatt Callow’s Srb1 dataset (#8). Newton’s, Sapir & Churchill’s and Chen’s single slide methodThe second simplest cDNA microarray dataanalysis problem is identifying differentiallyexpressed genes using replicated slidesThere are a number of different aspects:• First, between-slide normalization; then• What should we look at: averages, SDs, t-statistics, other summaries?• How should we look at them?• Can we make valid probability statements?A report on work in progress: begin with an example.• 8 treatment mice and 8 control mice• 16 hybridizations: liver mRNA from each of the 16 mice(Ti , Ci ) is labelled with Cy5, while pooled liver mRNA fromthe control mice (C*) is labelled with Cy3.• Probes: ~ 6,000 cDNAs (genes), including 200 related tolipid metabolism.Goal. To identify genes with altered expression in the livers ofApo AI knock-out mice (T) compared to inbred C57Bl/6 controlmice (C).Apo AI experiment (Matt Callow, LBNL)Which genes have changed?When permutation testing possible1. For each gene and each hybridisation (8 ko + 8 ctl), useM=log2(R/G).2. For each gene form the t statistic:average of 8 ko Ms - average of 8 ctl Mssqrt(1/8 (SD of 8 ko Ms)2 + (SD of 8 ctl Ms)2)3. Form a histogram of 6,000 t values.4. Do a normal q-q plot; look for values “off the line”.5. Permutation testing (next lecture).6. Adjust for multiple testing (next lecture).Histogram & normal q-q plot of t-statisticsApoA1What is a normal q-q plot? We have a random sample, say ti, i=1, …,n, which we believemight come from a normal distribution. If it did, then for suitable µ and σ, Φ((ti-µ)/σ), i=1,…n would be uniformly distributed on[0,1](why?), where Φ is the standard normal c.d.f.. Denoting theorder statistics of the t-sample by t(1) ,t(2) ,….,t(n) we can then seethat Φ((t(i) -µ)/σ) should be approximately i/n (why?). With this inmind, we’d expect t(i) to be about σΦ-1(i/n) + µ (why?). Thus if we plot t(i) against Φ-1((i+1/2/(n+1)), we might expect tosee a straight line of slope about σ with intercept about µ. (The1/2 and 1 in numerator and denominator of the i/n are to avoidproblems at the extremes.)This is our normal quantile-quantile plot, the i/n being a quantileof the uniform, and the Φ-1 being that of the normal.Why a normal q-q plot? One of the things we want to do with our t-statistics is roughlyspeaking, to identify the extreme ones. It is natural to rank them, but how extreme is extreme? Sincethe sample sizes here are not too small ( two samples of 8 eachgives 16 terms in the difference of the means), approximatenormality is not an unreasonable expectation for the nullmarginal distribution. Converting ranked t’s into a normal q-q plot is a great way tosee the extremes: they are the ones that are “off the line”, atone end or another. This technique is particularly helpful whenwe have thousands of values. Of course we can’t expect alldifferentially expressed genes to stand out as extremes: manywill be masked by more extreme random variation, which is abig problem in this context. See next lecture for a discussion ofthese issues.-4.55577-4.7948+4.84608-4.25867+3.12000-4.7941-7.74916-8.32526-9.11489-11538-111731-125330-134117-222139statisticindextgeneGene annotationApo AIEST, weakly sim. to STEROL DESATURASECATECHOL O-METHYLTRANSFERASEApo CIIIEST, highly sim. to Apo AIESTHighly sim. to Apo CIII precursorsimilar to yeast sterol desaturaseUseful plots of t-statisticsWhich genes have changed?Permutation testing not possibleOur current approach is to use averages, SDs, t-statisticsand a new statistic we call B, inspired by empiricalBayes.We hope in due course to calibrate B and use that as ourmain tool.We begin with the motivation, using data from a study inwhich each slide was replicated four times.Results from 4 replicatesPoints to noteOne set (green) has a high average M but also a high variance and a lowt.Another (pale blue) has an average M near zero but a very smallvariance, leading to a large negative t.A third (dark blue) has a modest average M and a low variance, leadingto a high positive t.A fourth (purple) has a moderate average M and a moderate variance,leading to a small t.Another pair (yellow, red) have moderate average Ms and middlingvariances, and moderately large ts.Which do we regard most favourably? Let’s look at M and t jointly.•M •t•t ∩∩∩∩MSets defined by cut-offs: from the Apo AI ko experiment•M •t•t ∩∩∩∩MResults from the Apo AI ko experiment•M•t•t ∩∩∩∩MApo AI experiment: t vs average A.•T•B•t ∩∩∩∩ M ∩∩∩∩B• t ∩∩∩∩BResults from SR-BI transgenic experiment•M•B•t•M ∩∩∩∩ B•t ∩∩∩∩B•t ∩∩∩∩ M∩∩∩∩BResults from SR-BI transgenic experimentAn empirical Bayes story Using average M alone, we ignore useful information in the SD acrossreplicated. Some large values are large because of outliers. Using t alone, we are liable to be misled by very small SDs. Withthousands of genes, some SDs will be very small. Formal testing can sort out these issues for us, but if we simply wantto rank, what should we rank on? One approach (SAM) is to inflate theSDs slightly. Another approach can be based on the followingempirical Bayes story. There are a number of variants. Suppose that


View Full Document

Berkeley STATISTICS 246 - Lecture Notes

Documents in this Course
Meiosis

Meiosis

46 pages

Meiosis

Meiosis

47 pages

Load more
Download Lecture Notes
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture Notes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture Notes 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?