Lecture 13 Diagnostics in MLR Added variable plots Identifying outliers Variance Inflation FactorRecall the added variable plotsExample: SENICR codeAdded Variable PlotWhat does that show?Slide 7Slide 8Residual PlotsWhich is better?Identifying Outliers“Hat” matrixMatrix Format for the MLR model“Transpose” and “Inverse”Estimating, based on fitted modelOther uses of HProperty of hij’sOther use of HUsing the Hat Matrix to identify outliersSlide 20Hat values versus indexIdentifying points with high hiiDoes a high hat mean it has a large residual?Let’s look at our MLR9Using the hat matrix in MLRDefining Studentized ResidualsDeleted ResidualsDeleted Residuals, diSlide 29Studentized Deleted ResidualsAnother nice resultTesting for outliersLecture 13Diagnostics in MLRAdded variable plotsIdentifying outliersVariance Inflation FactorBMTRY 701Biostatistical Methods IIRecall the added variable plotsThese can help check for adequacy of modelIs there curvature between Y and X after adjusting for the other X’s?“Refined” residual plotsThey show the marginal importance of an individual predictorHelp figure out a good form for the predictorExample: SENICRecall the difficulty determining the form for INFIRSK in our regression model.Last time, we settled on including one term, INFRISK^2But, we could do an adjusted variable plot approach.How?We want to know, adjusting for all else in the model, what is the right form for INFRISK?R codeav1 <- lm(logLOS ~ AGE + XRAY + CENSUS + factor(REGION) )av2 <- lm(INFRISK ~ AGE + XRAY + CENSUS + factor(REGION) )resy <- av1$residualsresx <- av2$residualsplot(resx, resy, pch=16)abline(lm(resy~resx), lwd=2)Added Variable Plot-2 -1 0 1 2 3-0.2 0.0 0.2 0.4resxresyWhat does that show?The relationship between logLOS and INFRISK if you added INFRISK to the regressionBut, is that what we want to see?How about looking at residuals versus INFRISK (before including INFRISK in the model)?R codemlr8 <- lm(logLOS ~ AGE + XRAY + CENSUS + factor(REGION))smoother <- lowess(INFRISK, mlr8$residuals)plot(INFRISK, mlr8$residuals)lines(smoother)2 3 4 5 6 7 8-0.2 0.0 0.2 0.4INFRISKmlr8$residualsR code> infrisk.star <- ifelse(INFRISK>4,INFRISK-4,0)> mlr9 <- lm(logLOS ~ INFRISK + infrisk.star + AGE + XRAY + > CENSUS + factor(REGION))> summary(mlr9)Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.798e+00 1.667e-01 10.790 < 2e-16 ***INFRISK 1.836e-03 1.984e-02 0.093 0.926478 infrisk.star 6.795e-02 2.810e-02 2.418 0.017360 * AGE 5.554e-03 2.535e-03 2.191 0.030708 * XRAY 1.361e-03 6.562e-04 2.073 0.040604 * CENSUS 3.718e-04 7.913e-05 4.698 8.07e-06 ***factor(REGION)2 -7.182e-02 3.051e-02 -2.354 0.020452 * factor(REGION)3 -1.030e-01 3.036e-02 -3.391 0.000984 ***factor(REGION)4 -2.068e-01 3.784e-02 -5.465 3.19e-07 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.1137 on 104 degrees of freedomMultiple R-Squared: 0.6209, Adjusted R-squared: 0.5917 F-statistic: 21.29 on 8 and 104 DF, p-value: < 2.2e-16Residual Plots2 3 4 5 6 7 8-0.2 -0.1 0.0 0.1 0.2 0.3 0.4INFRISKmlr9$residuals2 3 4 5 6 7 8-0.2 -0.1 0.0 0.1 0.2 0.3 0.4INFRISKmlr7$residualsSPLINE FOR INFRISKINFRISK2Which is better?Cannot compare via ANOVA because they are not nested!But, we can compare statistics qualitativelyR-squared:•MLR7: 0.60•MLR9: 0.62Partial R-squared:•MLR7: 0.17•MLR9: 0.19Identifying OutliersHarder to do in the MLR setting than in the SLR setting.Recall two concepts that make outliers important: •Leverage is a function of the explanatory variable(s) alone and measures the potential for a data point to affect the model parameter estimates. •Influence is a measure of how much a data point actually does affect the estimated model.Leverage and influence both may be defined in terms of matrices“Hat” matrixWe must do some matrix stuff to understand thisSection 6.2 is MLR in matrix termsNotation for a MLR with p predictors and data on n patients.The data:nYYYY21~npnppXXXXXXX1221111111~More notation:THE MODEL:What are the dimensions of each?Matrix Format for the MLR modelneeee21p10eXY “Transpose” and “Inverse”X-transpose: X’ or XTX-inverse: X-1Hat matrix = HWhy is H important? It transforms Y’s to Yhat’s:')'(1XXXXHHYY ˆEstimating, based on fitted model)()(2HIMSEes Variance-Covariance Matrix of residuals:)1()(2iiihMSEes Variance of ith residual:MSEhhsesijijij )0()(22Covariance of ith and jth residual:Other uses of HYHIe )( I = identity matrix)()(22HIe Variance-Covariance Matrix of residuals:)1()(22iiihe Variance of ith residual:222)0()(ijijijhhe Covariance of ith and jth residual:Property of hij’s ninjijijhh1 11This means that each row of H sums to 1And, that each column of H sums to 1Other use of HIdentifies points of leverage0 5 10 15-10 0 10 20 30 40xy1243Using the Hat Matrix to identify outliersLook at hii to see if a datapoint is an outlierLarge values of hii imply small values of var(ei)As hii gets close to 1, var(ei) approaches 0.Note that As hii approaches 1, yhat approaches yThis gives hii the name “leverage”HIGH HAT VALUE IMPLIES POTENTIAL FOR OUTLIER!jijijiiinjjijiyhyhyhy1ˆR codehat <- hatvalues(reg)plot(1:102, hat)highhat <- ifelse(hat>0.10,1,0)plot(x,y)points(x[highhat==1], y[highhat==1], col=2, pch=16, cex=1.5)Hat values versus index0 20 40 60 80 1000.02 0.06 0.10 0.141:102hatIdentifying points with high hii0 5 10 15-10 0 10 20 30 40xyDoes a high hat mean it has a large residual?No.hii measures leverage, not influenceRecall what hii is made of•it depends ONLY on the X’s•it does not depend on the actual Y valueLook back at the plot: which of these is probably most “influential”Standard cutoffs for “large” hii: •2p/n•0.5 very high, 0.2-0.5 highLet’s look at our MLR9Any outliers?0 20 40 60 80 1000.05 0.10 0.15 0.201:length(hat9)hat9Using the hat matrix in MLRStudentized
View Full Document