_X Y1 Y2 Y3 Y4 X4Stem & Leaf of Studentized ResidualStem & Leaf of Studentized ResidualSAS Code for Stem & LeafSPSS Code for Stem & LeafSAS Code to Detect HeteroscedasticitySPSS Code to Detect HeteroscedasticityInfluence Diagnostics in SASBASIC ASSUMPTIONS OF LEAST SQUARES REGRESSIONAssumption of IndependenceAssumption of NormalityPROBLEMATIC ISSUES THAT BIAS THE RESULTS OF THE REGRESSIONMeasurement ErrorMis-Specified Model (or Specification Error)MulticollinearityInfluential Observations_ X Y1 Y2 Y3 Y4 X410 8.04 9.14 7.46 8 6.58 8 6.95 8.14 6.77 8 5.7613 7.58 8.74 12.74 8 7.71 9 8.81 8.77 7.11 8 8.8411 8.33 9.26 7.81 8 8.4714 9.96 8.10 8.84 8 7.04 6 7.24 6.13 6.08 8 5.25 4 4.26 3.10 5.39 19 12.5012 10.84 9.13 8.15 8 5.56 7 4.82 7.26 6.42 8 7.91 5 5.68 4.74 5.73 8 6.89Course: Multiple Regression Topic: Regression Diagnostics 1REGRESSION DIAGNOSTICS: KNOW THY DATAThus far we have discussed the basics of bivariate and multiple regression and can easily compute a linear model that relates a dependent variable to one or more predictor variables. We should not, however, accept the results of the regression analysis without thoroughly examining and understanding our data to ensure that we have not violated assumptions of the regression analysis nor fallen prey to problems that bias the results of the analysis. The following example developed by Anscombe (1973, American Statistician, 27, p. 17-21; as described at Gerard Dallal’s website: wysiwyg://87/http://www.tufts.edu/~gdallal/anscombe.htm) demonstrates why a regression equation should not be accepted at face value.Anscombe generated data for six variables (Y1, Y2, Y3, Y4, X1, and X4) to demonstrate that very different patterns of data can produce identical regression models. The following table contains the data for the six variables.Regressing Y1, Y2, and Y3, respectively, on X and Y4 on X4 produces four identical regression equations: Y = 3.0 + .5X. The following plots reveal the pattern of actual data values for each bivariate association around the regression line. Y1=X Y2=XY3=X Y4=X4Course: Multiple Regression Topic: Regression Diagnostics 2The plots of Y1 on X, Y2 on X, Y3 on X, and Y4 on X4 reveal strikingly different patterns despite the common regression equation. Notice that the pattern of data for Y2 on X is curvilinear in nature. The pattern of data for Y4 on X4 reveals that the regression line is being pulled in an upward direction by one extreme point – the regression line would be flat if not for this extreme point. Hopefully this example convinces you of the importance of examining the data before blindly accepting the results of a regression analysis.The validity of the regression model can be seriously compromised by violations of assumptions of the regression model and by other problems. Today we will discuss the consequences of violating assumptions and other problematic issues and diagnostics that aid in the detection of assumption violation and various problems.BASIC ASSUMPTIONS OF LEAST SQUARES REGRESSIONWhen the regression equation is used to estimate population level relations from sample data it is important that we satisfy a set of assumptions. The assumptions are necessary for our inference from the sample to the population to be valid. Underlying our regression estimates (i.e.,y-intercept and betas) are hypothetical sampling distributions of all possible estimates (i.e., estimated beta-values) from a sample of a given size. And it is this sampling distribution that enables us to draw inferences from our sample (which provides one possible estimate from the sampling distribution) to the population. Satisfying the assumptions ensures that we know the characteristics of the sampling distribution for our estimates. When the assumptions are satisfied the sampling distribution for a given beta is normally distributed, the mean of the sampling distribution for a given beta is equal to the population parameter, and the variance of all possible sample estimates of beta (i.e., the sampling distribution) is as small as possible. So, when the assumptions are satisfied, estimates of the regression parameters are unbiased in the sense that across all possible samples the average sample value is equivalent to the population value and there will be little variation among the sample values (relative to other regression techniques). In other words, we our sample estimate is unbiased (i.e., accurate) and it’s standard error will be small – our sample provides a decent estimate of the population parameters. The basic assumptions of Least Squares Regression are phrased in regard to the residuals (i.e., errors) of the regression model. Those basic assumptions are that the errors are (1) independent, (2) normally distributed with a mean of zero, and (3) have a constant variance. Let’s examine consequences of violating each assumption and methods of detecting the violation. Assumption of IndependenceThe assumption of independence indicates that the errors for each observation are unrelated. Violating the assumption of independence result in a biased underestimation of the standard error of the regression coefficient (e.g., SEB), which in turn increases the Type I error rate of our significance tests (recall that the t-test of a beta is t = B/SE). Violations are incurred via sampling techniques. In cross sectional studies (i.e., studies in which data are collected at onepoint in time), sampling persons who share similar genetic or social-structural backgrounds usually violates the independence assumption. For example, if we have husbands and wives complete measures relationship satisfaction their responses are typically not independent becausethey share in common the relationship. The solution is to treat the relationship as the unit of analysis or use analysis strategies that explicitly model the dependence in the data (e.g., multi-Stem & Leaf of Studentized Residual Stem Leaf # 1 8 1 1 11 2 0 0 02 2 -0 100 3 -0 7 1 -1 -1 6 1 -2 1 1 ----+----+----+----+
View Full Document