MUSC BMTRY 701 - lect13 - D2115356

Home> Schools> Medical University of South Carolina> (BMTRY) > BMTRY 701> lect13

MUSC BMTRY 701 - lect13

School name Medical University of South Carolina

Course Bmtry 701- Biostatistical Methods II

Pages 40

Download Save

Unformatted text preview:

Lecture 13 Diagnostics in MLR Variance Inflation Factors Added variable plots Identifying outliersVariance Inflation Factor (VIF)How to calculate VIF?Sounds like a lot of work!XTXVIFsGetting the VIFs the old-fashioned wayMore practical way.What to do?More diagnostics: the added variable plotsExample: SENICR codeAdded Variable PlotWhat does that show?Slide 15Slide 16Residual PlotsWhich is better?Identifying Outliers“Hat” matrixMatrix Format for the MLR model“Transpose” and “Inverse”Estimating, based on fitted modelOther uses of HProperty of hij’sOther use of HUsing the Hat Matrix to identify outliersSlide 28Hat values versus indexIdentifying points with high hiiDoes a high hat mean it has a large residual?Let’s look at our MLR9Using the hat matrix in MLRDefining Studentized ResidualsDeleted ResidualsDeleted Residuals, diSlide 37Studentized Deleted ResidualsAnother nice resultTesting for outliersLecture 13Diagnostics in MLRVariance Inflation FactorsAdded variable plotsIdentifying outliersBMTRY 701Biostatistical Methods IIVariance Inflation Factor (VIF)Diagnostic for multicollinearityDescribes the amount of an X that is explained by the other X’s in the modelIf VIF is high, then it suggests that the covariate should not be added.Why?•it is redundant•it adds variance to the model•it creates ‘instability’ in the estimationHow to calculate VIF?Simple idea:That is, the VIF for the jth covariate is the coefficient of determination (R2) obtained from regressing xj on the remaining x’s in the model211jjRVIFexxxxxJJjjjjj1111110Sounds like a lot of work!You don’t actually have to estimate the regressions for each xj. Some matrix notation:•X = matrix of covariates including a column for the intercept•XT = transpose of X. That is, flip X on its diagonal•X-1 = the inverse of X. That is, what you multiply X by to get the identity matrix•I = the identity matrix. A matrix with 0’s on the off-diagonal and 1’s on the diagonalUseful matrix: XTX. (see chapter 3 for lots on it!)Another useful matrix: (XTX)-1XTXRecall what it means to standardize a variable:•subtract off the mean•divide by the standard deviationImagine that you standardize all of the variables in your model (x’s).Call the new covariate matrix WNow, if calculate WTW (and divide by n-1), it is the correlation matrixLastly, take the inverse of WTW (i.e., (WTW)-1)VIFsThe diagonals of the (WTW)-1 matrix are the VIFsThis is a natural by-product of the regressionThe (WTW)-1 matrix is estimated when the regression is estimatedRules of thumb:•VIF larger than 10 implies a serious multicollinearity problem•VIFs of 5 or greater suggest that coefficient estimates may be misleading due to multicollinearityGetting the VIFs the old-fashioned way# standardize variablesages <- (AGE-mean(AGE))/sqrt(var(AGE))censuss <- (CENSUS - mean(CENSUS))/sqrt(var(CENSUS))xrays <- (XRAY - mean(XRAY))/sqrt(var(XRAY))infrisks <- (INFRISK-mean(INFRISK))/sqrt(var(INFRISK))sqrtcults <- (sqrtCULT-mean(sqrtCULT))/sqrt(var(sqrtCULT))nurses <- (NURSE - mean(NURSE))/sqrt(var(NURSE))# create matrix of covariatesxmat <- data.frame(ages, censuss, xrays, infrisks, sqrtcults, nurses)xmat <- as.matrix(xmat)n <- nrow(xmat) # estimate x-transpose x and divide by n-1cormat <- t(xmat)%*%xmat/(n-1)# solve finds the inverse of a matrixvifmat <- solve(cormat)round(diag(vifmat), 2)More practical way.library(HH)mlr <- lm(logLOS ~ AGE + CENSUS + XRAY + INFRISK + sqrtCULT + NURSE)round(diag(vifmat), 2) ages censuss xrays infrisks sqrtcults nurses 1.10 5.88 1.39 2.01 1.92 5.94 vif(mlr) AGE CENSUS XRAY INFRISK sqrtCULT NURSE 1.096204 5.875625 1.390417 2.007692 1.916983 5.935711What to do?Unlikely that only one variable will have high VIFYou need to then determine which to include, which to removeJudgement should be based on science + statistics!More diagnostics: the added variable plotsThese can help check for adequacy of modelIs there curvature between Y and X after adjusting for the other X’s?“Refined” residual plotsThey show the marginal importance of an individual predictorHelp figure out a good form for the predictorExample: SENICRecall the difficulty determining the form for INFIRSK in our regression model.Last time, we settled on including one term, INFRISK^2But, we could do an adjusted variable plot approach.How?We want to know, adjusting for all else in the model, what is the right form for INFRISK?R codeav1 <- lm(logLOS ~ AGE + XRAY + CENSUS + factor(REGION) )av2 <- lm(INFRISK ~ AGE + XRAY + CENSUS + factor(REGION) )resy <- av1$residualsresx <- av2$residualsplot(resx, resy, pch=16)abline(lm(resy~resx), lwd=2)Added Variable Plot-2 -1 0 1 2 3-0.2 0.0 0.2 0.4resxresyWhat does that show?The relationship between logLOS and INFRISK if you added INFRISK to the regressionBut, is that what we want to see?How about looking at residuals versus INFRISK (before including INFRISK in the model)?R codemlr8 <- lm(logLOS ~ AGE + XRAY + CENSUS + factor(REGION))smoother <- lowess(INFRISK, mlr8$residuals)plot(INFRISK, mlr8$residuals)lines(smoother)2 3 4 5 6 7 8-0.2 0.0 0.2 0.4INFRISKmlr8$residualsR code> infrisk.star <- ifelse(INFRISK>4,INFRISK-4,0)> mlr9 <- lm(logLOS ~ INFRISK + infrisk.star + AGE + XRAY + > CENSUS + factor(REGION))> summary(mlr9)Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.798e+00 1.667e-01 10.790 < 2e-16 ***INFRISK 1.836e-03 1.984e-02 0.093 0.926478 infrisk.star 6.795e-02 2.810e-02 2.418 0.017360 * AGE 5.554e-03 2.535e-03 2.191 0.030708 * XRAY 1.361e-03 6.562e-04 2.073 0.040604 * CENSUS 3.718e-04 7.913e-05 4.698 8.07e-06 ***factor(REGION)2 -7.182e-02 3.051e-02 -2.354 0.020452 * factor(REGION)3 -1.030e-01 3.036e-02 -3.391 0.000984 ***factor(REGION)4 -2.068e-01 3.784e-02 -5.465 3.19e-07 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.1137 on 104 degrees of freedomMultiple R-Squared: 0.6209, Adjusted R-squared: 0.5917 F-statistic: 21.29 on 8 and 104 DF, p-value: < 2.2e-16Residual Plots2 3 4 5 6 7 8-0.2 -0.1 0.0 0.1 0.2 0.3 0.4INFRISKmlr9$residuals2 3 4 5 6 7 8-0.2 -0.1 0.0 0.1 0.2 0.3

View Full Document


School:
Email:
New Password:
Confirm Password:

MUSC BMTRY 701 - lect13

Sign up for free to view:

Please select your school