Unformatted text preview:

Additional Regression Diagnostics Model adequacy : evaluation of independent variables It is possible to graphically examine “Y on X" after adjusting for anothervariable. This is called a .partial regression plot The idea is that we are going to examine the residuals of Y, after adjustingfor other variable(s). However, in the sweepout EVERYTHING getsadjusted, including the X variable. Therefore, we plot e in e , both adjusted for other variable(s) in the model.YX eg. Given the model Y = + X + 3!""33"" % now we want to add X , and we would like to examine its relation#3graphically to Y after adjusting for X ."Then we get: e(Y|X ) and e(X |X )"#" and plot e(Y|X ) on e(X |X )"#"SEE HANDOUTPartial regression plots are not used like residual plots (to examine a fit) even though I put in a VREF line for comparison but are used more like raw data plots to examinea) the potential for a fit, or the need for fitting a variableb) examine for curvatureMore on Identifying outliers : beyond the residual plot Identifying unusual X values : leverage We usually think of outliers as being values of Y which are out of line. However, it may be values of X which are off. In order to determine this wecan calculate .leverage H = X(X X) X , recall the HAT MATRIXw"w- uses Y = HY^ e = Y - Y = (I - H)Y^ = (I H) gives an individual variance of each e5522/3 since the variance changes along the reg line in the hat matrix, the vector X is some observed set of X values eg. X = {1 X X X ]w"#$If we take a particular OBSERVED set of X's, call it X , and calculate3 H = X (X X) X this gives a value we will call h33w"w3- likewise, we take the whole n*p X matrix, and get a H matrix (the hatmatrix) H = X(X X) Xw"w-then the diagonal elements are those “h" elements, called h values33These h values have certain properties,33 1) the values are between 0 and 1 2) the values sum to p (the number of regressors plus the intercept)these values are called leverage of the i case>2they provide a relative measure of the distance between the i case and the mean>2of all casestherefore, a large leverage value indicates an “outlier" in that the value is farremoved from the meanfurthermore, since Y = HY, the values of the hat matrix represent weights applied^to the Y vector in calculating the predicted values of Y.How are the h values used?33 The mean of the h values (since they sum to p) is33 h = (note that this is < 1)–33pn a leverage value is considered “large" if it is more than twice the value h–33 ie. h > is considered a possible outlier–332pn also, as a general rule, h values greater than 0.5 are “large"33while those between 0.2 and 0.5 are moderately largealso look for a leverage value which is noticeably larger than the next largest SEE HANDOUTIdentifying unusual Y values : outliersFirst look at the residuals, this is always a valuable technique. Studentized residuals Since we cannot look a an individual residual and determine if it is “toolarge", we could standardize the residuals to a mean of zero and avariance of 1. Since the residuals already have a mean of zero, we need only calculate Studentized residual = eMSE 3È If normally distributed (which we have already assumed) then about 65% are between -1 and +1 about 95% are between -2 and +2 about 99% are between -2.5 and +2.5These are avail in SAS as student Internally adjusted studentized residuals : instead of adjusting all of theresiduals to MSE, we could adjust each residual to its own variance where s = MSE(1-h ) - this is not the same as s22Y^/333 e = ‡3es 3/3These values are not provided in SAS, but the individual components e and s are3/3provided, so the values could be calculated readily.!!! SAS gives these (not divided by MSE)Deleted residuals - Sometimes a value is such a great outlier that it pulls the wholeregression line towards itself. It would then be useful to know how farthat point lies from a regression line which is fitted with that pointexcluded. This requires that we fit the regression line WITHOUT eachpoint, and then calculate the deviation of each point. This sounds like a lot of work, but is readily calculated with the h values33previously mentioned. d = = Y Y333Ð3Ñe1h333 This will identify residuals which pull the line to themselves, and may notbe otherwise easily detected.Further, we could calculate the variance (MSE) adjusted values of the deletedresiduals either values adjusted for the residuals individual variance or studentized values of all deleted residualsThese studentized deleted residuals values are available in SAS as RStudentInternally adjusted studentized residuals are not available in SAS, but thecomponents are available for each observation.Identifying influential cases an outlier may not have much influence on theregression line, or its influence may be very great. Once an outlier is found, we ask; “How much influence does the outlierhave on the regression line?" If the point were omitted, would the resultschange? There are three statistics which measure “influence", all calculated for eachobservation in the regression line.How much does the predicted value change? DFFITS (difference in fits as judged by the predicted values) DFFITS = YY^^MSE h33Ð3ÑÐ3Ñ33 since the value is standardized by MSE, it represents roughly the number ofstandard deviation units that the predicted value changes when theparticular point is omitted.How much does a particular regression coefficient change? DFBETAS (difference in fits as judged by the regression coefficients) DFBETAS = not that this is also a standardized valuebbMSE c55Ð3ÑÐ3Ñ55 interpretation is similar to DFFITS!!! this value is for standardized b in SAS ??? check this3How much do the regression coefficients change overall?Cook's D (D is for distance)The boundary of a simultaneous regional confidence region for all regressioncoefficients can be calculated as; D = = F(1- ; p,n p)(b ) X X(b )pMSE""ww! This can be modified to determine the effect of removing a point on all of theregression coefficients simultaneously by D = = F(1- ; p,n p)3(b b ) X X(b b )pMSEÐ3Ñ Ð3Ñww! this does not follow an F distribution, but it is useful to compare it to thepercentiles of the F distribution. if < 10 or 20


View Full Document
Download Study Notes
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Study Notes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Study Notes 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?