Course Multiple Regression Topic Regression Diagnostics 1 REGRESSION DIAGNOSTICS KNOW THY DATA Thus far we have discussed the basics of bivariate and multiple regression and can easily compute a linear model that relates a dependent variable to one or more predictor variables We should not however accept the results of the regression analysis without thoroughly examining and understanding our data to ensure that we have not violated assumptions of the regression analysis nor fallen prey to problems that bias the results of the analysis The following example developed by Anscombe 1973 American Statistician 27 p 17 21 as described at Gerard Dallal s website wysiwyg 87 http www tufts edu gdallal anscombe htm demonstrates why a regression equation should not be accepted at face value Anscombe generated data for six variables Y1 Y2 Y3 Y4 X1 and X4 to demonstrate that very different patterns of data can produce identical regression models The following table contains the data for the six variables X Y1 10 8 04 8 6 95 13 7 58 9 8 81 11 8 33 14 9 96 6 7 24 4 4 26 12 10 84 7 4 82 5 5 68 Y2 Y3 9 14 7 46 8 14 6 77 8 74 12 74 8 77 7 11 9 26 7 81 8 10 8 84 6 13 6 08 3 10 5 39 9 13 8 15 7 26 6 42 4 74 5 73 Y4 X4 8 6 58 8 5 76 8 7 71 8 8 84 8 8 47 8 7 04 8 5 25 19 12 50 8 5 56 8 7 91 8 6 89 Regressing Y1 Y2 and Y3 respectively on X and Y4 on X4 produces four identical regression equations Y 3 0 5X The following plots reveal the pattern of actual data values for each bivariate association around the regression line Y1 X Y2 X Y3 X Y4 X4 Course Multiple Regression Topic Regression Diagnostics 2 The plots of Y1 on X Y2 on X Y3 on X and Y4 on X4 reveal strikingly different patterns despite the common regression equation Notice that the pattern of data for Y2 on X is curvilinear in nature The pattern of data for Y4 on X4 reveals that the regression line is being pulled in an upward direction by one extreme point the regression line would be flat if not for this extreme point Hopefully this example convinces you of the importance of examining the data before blindly accepting the results of a regression analysis The validity of the regression model can be seriously compromised by violations of assumptions of the regression model and by other problems Today we will discuss the consequences of violating assumptions and other problematic issues and diagnostics that aid in the detection of assumption violation and various problems BASIC ASSUMPTIONS OF LEAST SQUARES REGRESSION When the regression equation is used to estimate population level relations from sample data it is important that we satisfy a set of assumptions The assumptions are necessary for our inference from the sample to the population to be valid Underlying our regression estimates i e y intercept and betas are hypothetical sampling distributions of all possible estimates i e estimated beta values from a sample of a given size And it is this sampling distribution that enables us to draw inferences from our sample which provides one possible estimate from the sampling distribution to the population Satisfying the assumptions ensures that we know the characteristics of the sampling distribution for our estimates When the assumptions are satisfied the sampling distribution for a given beta is normally distributed the mean of the sampling distribution for a given beta is equal to the population parameter and the variance of all possible sample estimates of beta i e the sampling distribution is as small as possible So when the assumptions are satisfied estimates of the regression parameters are unbiased in the sense that across all possible samples the average sample value is equivalent to the population value and there will be little variation among the sample values relative to other regression techniques In other words we our sample estimate is unbiased i e accurate and it s standard error will be small our sample provides a decent estimate of the population parameters The basic assumptions of Least Squares Regression are phrased in regard to the residuals i e errors of the regression model Those basic assumptions are that the errors are 1 independent 2 normally distributed with a mean of zero and 3 have a constant variance Let s examine consequences of violating each assumption and methods of detecting the violation Assumption of Independence The assumption of independence indicates that the errors for each observation are unrelated Violating the assumption of independence result in a biased underestimation of the standard error of the regression coefficient e g SEB which in turn increases the Type I error rate of our significance tests recall that the t test of a beta is t B SE Violations are incurred via sampling techniques In cross sectional studies i e studies in which data are collected at one point in time sampling persons who share similar genetic or social structural backgrounds usually violates the independence assumption For example if we have husbands and wives complete measures relationship satisfaction their responses are typically not independent because they share in common the relationship The solution is to treat the relationship as the unit of analysis or use analysis strategies that explicitly model the dependence in the data e g multi Course Multiple Regression Topic Regression Diagnostics 3 level models In longitudinal studies i e studies in which data are collected from the same persons across time the repeated measurements on the same persons violate the independence assumption The solution is to use analysis strategies appropriate for time series data Assumption of Normality The normality assumption indicates that the errors are normally distributed and ensures that the p values for our inferential tests are accurate The results of statistical tests however are typically robust to violations of normality particularly when sample size is large Diagnosing normality The normality assumption can be investigated by graphing the residuals Two such techniques are the stem and leaf plot and probability plot The stem and leaf plot can be conducted on the residuals standardized residuals or studentized residuals The residuals are simply the discrepancy between the predicted and observed value A standardized residual is a residual that is divided by the standard error of the estimate i e adjusted by the expected amount of error A studentized residual is similar to a standardized residual however the residual is divided by the
View Full Document