1Summarizing many probe intensitiesStatistics 246 Spring 2006Week 9 Lecture 22Summarizing11-20probeintensitypairstogiveameasureofexpressionforaprobesetTherearemanylow-levelsummariesinuse.Hereareafew.• MAS5.0(theAffymetrixmethodwhichreplacedAvDiffdescribedpreviously);• dChip,describedpreviously;• RMA=RobustMulti-chipAnalysistobedescribednow;• GCRMA(thesuccessortoRMA);and• PLIER(thesuccessortoMAS5.0).3TheMAS5.0expressionsummary Log{Signal Intensity} = TukeyBiweight{log(PMj - MMj*)}whereMMj*aversionofMMjthatisneverbiggerthanPMj(thedetailsarenotimportant),andTukeyBiweight is arobust/resistant average of the 11-20 quantities.Exercise. Tukey’s biweight location estimator is an M-estimator. What is its form?There are at least 2 problems with this summary. First, it is stillsingle chip, so cannot benefit from information across chips.And second, the robustness is not necessarily in the correctplace. A probe may function perfectly well, but have muchhigher intensities than the other probes. This procedurewould downweight, perhaps ignore it, for no good reason.4Some influence functions(courtesy of Philip Stark)5Some weight functions(courtesy of Philip Stark)6What RMA does: four steps It uses only the PMs and ignore the MMs. Also, it• Adjusts for background: call this -*BG• Carries out quantile normalization of PM-*BG withchips in suitable sets: call the result n(PM-*BG)• Takes log2 of normalized background adjusted PM;• Carries out a Robust Multi-chip linear model Analysis(RMA) of the quantities log2n(PM-*BG).7Why ignore the MM values? The reason was that it was hard to do better usingthem. They definitely have information, - about bothsignal and noise - but using them without addingmore noise (see below) seemed to be a challenge. GCRMA and other improved bg correction methodsmake use of MMs, though this hardly justifieshaving one MM for every PM. A pool of controlprobes for use in estimating non-specifichybridization would do just as well, and this is nowbeing used on some Affymetrix chips.α,µσ8Why take log2 ?LookatSDsfromreplicatechips9Why Multi-chip Analysis? To put each chip’s values in the context of a set ofsimilar values. We see that there are substantialprobe effects in the responses (next slides), andthat these are repeatable. Such an obvious feature - the parallelism of proberesponse across chips in these figures - should beexploited in any analysis. This leads to an additive model on the log scale (ora multiplicative model on the intensity scale, as indChip).10Probe Intensity vs conc ex 1A glance at some raw data:20 probe spike-in set across 14 arraysPM intensityConcentration in pM = array11Probe Intensity vs conc ex 216 probe spike-in set: 6 non-responding12Whywritelog2n(PM-*BG)≈chipeffect+probeeffect?BecauseprobeeffectsareadditiveonthelogscaleThespikeindatasetintheprevioustwoslidesshowedit.Indeedeverysetofexperimentsshouldexhibitthisparallelbehaviouracrossprobes.Canyouseewhy?13WhyaRobustratherthanLeastSquaresfit?In these large bodies of data we see many (perhaps up to 10%)outliers: image artifacts, bad probes, bad chips….Robust summaries really can improve over the standard ones,by down-weighting outliers and leaving their effects visible inresiduals. Not only are the estimates of quantities that matter better, wecan use aspects of the robust analysis to carry out qualityassessment.14How Robust Multi-chip Analysis works The analysis involves robustly fitting the followinglinear model for one probe set: log2 n(PMij -*BG) = ai + bj + εij where i labels chips, j labels probes, and Σbj = 0 isimposed for identifiability. Here εij are the errors,assumed iid with mean zero, constant variance σ2. Initially we used median polish, but the currentimplementation uses Huber’s ψ. We’ll describe both.15RMA in summary• BackgroundcorrectPM• Carryoutquantilenormalization• Takelog2Undertheadditivemodel log2 n(PMij -*BG) = ai + bj + εij• Estimatechipeffectsai and probe effects bj(with Σbj = 0) using a robust/resistant method.16M-estimatorsOne can estimate the parameters of the model as solutions to€ minai,bjρ(Yij− ai− bjˆ σ )2=i, j∑minai,bjρ(uij)2i, j∑where ρ is a symmetric, positive-definite function increasingless rapidly than x2. Solutions to this minimization problemcan be obtained by an Iteratively Reweighted Least Squares(IRWLS) procedure with weights:€ wij=′ ρ (uij) uij=ψ(uij) /uij17Robust fit by IRWLSAt each iterationrij = Yij - current est(ai) - current est(bj),S = k.MAD(rij) a robust estimate of the scale parameter σuij = rij/S standardized residualswij = ψ(uij)/uij weights to reduce the effect of discrepantpoints on the next fitNext step estimates are:nextest(ai) = weighted row i meannextest(bj) = weighted col j mean – overall weighted mean18The way robustness works here We look at the parameter estimates from a robust two-way analysis using median polish and the usual leastsquares on data from 37 control probe sets across 6chips. Overall we have 20x37 probe effects, 6x37chip effects, and 20x6x38 residuals. This was repeated for 37 randomly chosen probesets, to check that the control probe sets were notatypical. They weren’t. After observing certain patterns, an explanation isoffered of the way robustness works.19The two methodsEach fits the same additive model to probe level data.Median polish (mp) fits iteratively, successively removing rowand column medians, and accumulating the terms, until theprocess stabilizes. The residuals are what is left at the end.Least squares (lm) uses the familiar closed-form estimates ofthe parameters a and b (what are they?), and again theresiduals are what is left after subtracting them from theobservations.20Note the slightly different shapes of their distributionsTwo sets of residuals (control probes)21Similarly with random probe sets22 Normal qq plots of mp and lm residualsWhich has (slightly) fatter tails?23mp residuals have fatter tails than lm, but
View Full Document