Lecture 13 Diagnostics in MLR Variance Inflation Factors Added variable plots Identifying outliersVariance Inflation Factor (VIF)How to calculate VIF?Sounds like a lot of work!XTXVIFsGetting the VIFs the old-fashioned wayMore practical way.What to do?More diagnostics: the added variable plotsExample: SENICR codeAdded Variable PlotWhat does that show?Slide 15Slide 16Residual PlotsWhich is better?Identifying Outliers“Hat” matrixMatrix Format for the MLR model“Transpose” and “Inverse”Estimating, based on fitted modelOther uses of HProperty of hij’sOther use of HUsing the Hat Matrix to identify outliersSlide 28Hat values versus indexIdentifying points with high hiiDoes a high hat mean it has a large residual?Let’s look at our MLR9Using the hat matrix in MLRDefining Studentized ResidualsDeleted ResidualsDeleted Residuals, diSlide 37Studentized Deleted ResidualsAnother nice resultTesting for outliersLecture 13Diagnostics in MLRVariance Inflation FactorsAdded variable plotsIdentifying outliersBMTRY 701Biostatistical Methods IIVariance Inflation Factor (VIF)Diagnostic for multicollinearityDescribes the amount of an X that is explained by the other X’s in the modelIf VIF is high, then it suggests that the covariate should not be added.Why?•it is redundant•it adds variance to the model•it creates ‘instability’ in the estimationHow to calculate VIF?Simple idea:That is, the VIF for the jth covariate is the coefficient of determination (R2) obtained from regressing xj on the remaining x’s in the model211jjRVIFexxxxxJJjjjjj1111110Sounds like a lot of work!You don’t actually have to estimate the regressions for each xj. Some matrix notation:•X = matrix of covariates including a column for the intercept•XT = transpose of X. That is, flip X on its diagonal•X-1 = the inverse of X. That is, what you multiply X by to get the identity matrix•I = the identity matrix. A matrix with 0’s on the off-diagonal and 1’s on the diagonalUseful matrix: XTX. (see chapter 3 for lots on it!)Another useful matrix: (XTX)-1XTXRecall what it means to standardize a variable:•subtract off the mean•divide by the standard deviationImagine that you standardize all of the variables in your model (x’s).Call the new covariate matrix WNow, if calculate WTW (and divide by n-1), it is the correlation matrixLastly, take the inverse of WTW (i.e., (WTW)-1)VIFsThe diagonals of the (WTW)-1 matrix are the VIFsThis is a natural by-product of the regressionThe (WTW)-1 matrix is estimated when the regression is estimatedRules of thumb:•VIF larger than 10 implies a serious multicollinearity problem•VIFs of 5 or greater suggest that coefficient estimates may be misleading due to multicollinearityGetting the VIFs the old-fashioned way# standardize variablesages <- (AGE-mean(AGE))/sqrt(var(AGE))censuss <- (CENSUS - mean(CENSUS))/sqrt(var(CENSUS))xrays <- (XRAY - mean(XRAY))/sqrt(var(XRAY))infrisks <- (INFRISK-mean(INFRISK))/sqrt(var(INFRISK))sqrtcults <- (sqrtCULT-mean(sqrtCULT))/sqrt(var(sqrtCULT))nurses <- (NURSE - mean(NURSE))/sqrt(var(NURSE))# create matrix of covariatesxmat <- data.frame(ages, censuss, xrays, infrisks, sqrtcults, nurses)xmat <- as.matrix(xmat)n <- nrow(xmat) # estimate x-transpose x and divide by n-1cormat <- t(xmat)%*%xmat/(n-1)# solve finds the inverse of a matrixvifmat <- solve(cormat)round(diag(vifmat), 2)More practical way.library(HH)mlr <- lm(logLOS ~ AGE + CENSUS + XRAY + INFRISK + sqrtCULT + NURSE)round(diag(vifmat), 2) ages censuss xrays infrisks sqrtcults nurses 1.10 5.88 1.39 2.01 1.92 5.94 vif(mlr) AGE CENSUS XRAY INFRISK sqrtCULT NURSE 1.096204 5.875625 1.390417 2.007692 1.916983 5.935711What to do?Unlikely that only one variable will have high VIFYou need to then determine which to include, which to removeJudgement should be based on science + statistics!More diagnostics: the added variable plotsThese can help check for adequacy of modelIs there curvature between Y and X after adjusting for the other X’s?“Refined” residual plotsThey show the marginal importance of an individual predictorHelp figure out a good form for the predictorExample: SENICRecall the difficulty determining the form for INFIRSK in our regression model.Last time, we settled on including one term, INFRISK^2But, we could do an adjusted variable plot approach.How?We want to know, adjusting for all else in the model, what is the right form for INFRISK?R codeav1 <- lm(logLOS ~ AGE + XRAY + CENSUS + factor(REGION) )av2 <- lm(INFRISK ~ AGE + XRAY + CENSUS + factor(REGION) )resy <- av1$residualsresx <- av2$residualsplot(resx, resy, pch=16)abline(lm(resy~resx), lwd=2)Added Variable Plot-2 -1 0 1 2 3-0.2 0.0 0.2 0.4resxresyWhat does that show?The relationship between logLOS and INFRISK if you added INFRISK to the regressionBut, is that what we want to see?How about looking at residuals versus INFRISK (before including INFRISK in the model)?R codemlr8 <- lm(logLOS ~ AGE + XRAY + CENSUS + factor(REGION))smoother <- lowess(INFRISK, mlr8$residuals)plot(INFRISK, mlr8$residuals)lines(smoother)2 3 4 5 6 7 8-0.2 0.0 0.2 0.4INFRISKmlr8$residualsR code> infrisk.star <- ifelse(INFRISK>4,INFRISK-4,0)> mlr9 <- lm(logLOS ~ INFRISK + infrisk.star + AGE + XRAY + > CENSUS + factor(REGION))> summary(mlr9)Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.798e+00 1.667e-01 10.790 < 2e-16 ***INFRISK 1.836e-03 1.984e-02 0.093 0.926478 infrisk.star 6.795e-02 2.810e-02 2.418 0.017360 * AGE 5.554e-03 2.535e-03 2.191 0.030708 * XRAY 1.361e-03 6.562e-04 2.073 0.040604 * CENSUS 3.718e-04 7.913e-05 4.698 8.07e-06 ***factor(REGION)2 -7.182e-02 3.051e-02 -2.354 0.020452 * factor(REGION)3 -1.030e-01 3.036e-02 -3.391 0.000984 ***factor(REGION)4 -2.068e-01 3.784e-02 -5.465 3.19e-07 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.1137 on 104 degrees of freedomMultiple R-Squared: 0.6209, Adjusted R-squared: 0.5917 F-statistic: 21.29 on 8 and 104 DF, p-value: < 2.2e-16Residual Plots2 3 4 5 6 7 8-0.2 -0.1 0.0 0.1 0.2 0.3 0.4INFRISKmlr9$residuals2 3 4 5 6 7 8-0.2 -0.1 0.0 0.1 0.2 0.3
View Full Document