Influence on single fitted value – DFFITSInfluence on the regression coefficients - DFBETAS10.4 Identifying influential cases – DFFITS, Cook’s distance, and DFBETAS measuresOnce outlying observations are identified, it needs to be determined if they are influential to the sample regression model. Influence on single fitted value – DFFITSThe influence of observation i on iˆY is measured by:i i(i)i(i) iiˆ ˆY Y(DFFITS)MSE h“DF” stands DIFFerence in FITted values.(DFFITS)i the number of standard deviations by whichiˆY changes when observation i is removed from the data set. DFFITS can be repressed as: :1/ 21/ 2 1/ 2ii iii i i2ii iiii in p 1 h h(DFFITS) e t1 h 1 hSSE(1 h ) e Therefore, only one regression model needs to be fit. Guideline for determining influential observations: |(DFFITS)i|>1 for “small to medium” sized data sets |(DFFITS)i|>2 p / n for large data sets 2012 Christopher R. Bilder10.28Influence on all fitted values – Cook’s DistanceCook is a graduate of Kansas State University and is aprofessor at the University of Minnesota. Measures the influence of the ith observation on ALL n predicted values. Cook’s Distance is: n2j j(i)j 1iˆ ˆ(Y Y )DpMSENotes: 1.The numerator is similar to (DFFITS)i. For Cook’s Distance, ALL of the fitted values are compared. 2.The denominator serves as a standardizing measure. Cook’s Distance can be repressed as:2i iii2iie hDpMSE(1 h ) Therefore, only one regression model needs to be fit. From examining the above formula, note how Di can be large (examine ei and hii).Guideline for determining influential observations: Di>F(0.50, p, n-p) 2012 Christopher R. Bilder10.29Influence on the regression coefficients - DFBETASMeasures the influence of the ith observation on each estimated regression coefficient, bk. Let bk(i) be the estimate of k with the ith observation removed from the data set, and ckk be the kth diagonal element of (XX)-1 (remember that X is a np matrix)Then k k(i)k(i)(i) kkb b(DFBETAS)MSE c-= for k=1,…,p-1Notes: 1.Notice that a DFBETAS is calculated for each k andeach observation. 2.Remember that 2 1Var( ) ( )-�=sb X X from Chapter 5 and 6. Thus, the variance of kb is 2ckk. In this case,2 is estimated by MSE(i). Therefore, the denominator serves as a standardizing measure.Guideline for determining influential observations: |k(i)(DFBETAS)|>1 for “small to medium” sized datasets |k(i)(DFBETAS)|>2 / n for large data sets 2012 Christopher R. Bilder10.30Influence on inferencesExamine the inferences from the sample regression model with and without the observation(s) of concerned. If inferences are unchanged, remedial action is not necessary. If inferences are changed, remedial action isnecessary. Some final comments – p. 406 of KNNREAD! See discussion on “masking” effect. Example: HS and College GPA data set with extra observation (HS_GPA_ch10.R)Suppose we look at the data set with the extra observation of (HS GPA, College GPA) = (X, Y) = (4.35, 1.5) added. > dffits.i<-dffits(model = mod.fit)> dffits.i[abs(dffits.i)>1] 21 -3.348858 > n<-length(mod.fit$residuals)> p<-length(mod.fit$coefficients)> dffits.i[abs(dffits.i)>2*sqrt(p/n)] 21 -3.348858> cook.i<-cooks.distance(model = mod.fit)> cook.i[cook.i>qf(p=0.5,df1=p, df2=mod.fit$df.residual)] 21 2012 Christopher R. Bilder10.311.658858> #Be careful - dfbeta() (without the "s") finds something a little different> dfbeta.all<-dfbetas(model = mod.fit)> dfbeta.all[abs(dfbeta.all[,2])>1,2] #Do not need to look at beta0, only beta1[1] -2.911974> dfbeta.all[abs(dfbeta.all[,2])>2/sqrt(n),2] 18 21 0.4408765 -2.9119744> round(dfbeta.all,2) (Intercept) HS.GPA1 -0.01 0.082 0.00 0.003 0.06 0.014 -0.10 0.075 0.00 0.006 -0.25 0.347 -0.09 0.198 0.09 -0.049 0.04 0.0110 0.07 -0.0611 -0.08 0.0412 0.02 -0.0313 0.12 -0.0914 -0.05 0.0815 0.05 -0.0416 -0.31 0.2617 -0.08 0.0418 -0.31 0.4419 -0.02 0.0120 -0.22 0.1621 2.17 -2.91Again, the extra observation added is found by these measures to be potentially influential. One should now examine the model with and without the observation. 2012 Christopher R. Bilder10.32Previously without the observation, b0 = 0.70758 andb1 = 0.69966. With the observation, b0 = 1.1203 andb1 = 0.5037. Thus, we see a change of (0.69966 – 0.5037)/0.69966 = 28% for b1. While the direction ofthe relationship has not changed, there is a considerable amount of change in strength. This corresponds to what the DFBETAS found.With respect to the ˆY value for X = 4.35, we obtain 3.751092 when using the data set without the extra observation and 3.311581 with the observation. Thus, we see a change of (3.751092 – 3.311581)/ 3.751092 = 11.72%. With respect to percentages, this is possibly not a “large” change. Although based upon our understanding of GPAs, we may think of this more as an important change. What should you do now? Below are possibilities.1)Find a new predictor variable2)Remove it? NEED justification beyond it being influential. 3)Use a different estimation method than least squares which is not as sensitive (Chapter 11)4)Leave it? Include in the model’s interpretation that this observation existsOf course, it is important to make sure the observation’s data values are correct too. 2012 Christopher R. Bilder10.33Example: NBA guard data (nba_ch10.R)Often when there are a large number of observations, examining graphical summaries of these influence measures can be helpful. > #DFFITS vs. observation number> plot(x = 1:n, y = dffits.i, xlab = "Observation number", ylab = "DFFITS", main = "DFFITS vs. observation number", panel.first = grid(col = "gray", lty = "dotted"), ylim = c(min(-1, - 2*sqrt(p/n), min(dffits.i)), max(1, 2*sqrt(p/n), max(dffits.i))))> abline(h = 0, col = "darkgreen")> abline(h = c(-2*sqrt(p/n), 2*sqrt(p/n)), col = "red", lwd = 2)> abline(h = c(-1,1), col = "darkred", lwd = 2)> identify(x = 1:n, y = dffits.i)[1] 7 21 37 52 53 72 73 104 2012
View Full Document