1 REGRESSION DIAGNOSTICS KNOW THY DATA 2 Know Thy Data Never blindly accept the results of a regression analysis Violation of assumptions and other problematic issues can produce misleading or potentially inaccurate results Examine the results in regard to the nature of your data An Example from Anscombe 1973 3 Different patterns of data can produce the same regression Y1 X X Y1 10 8 04 8 6 95 13 7 58 9 8 81 11 8 33 14 9 96 6 7 24 4 4 26 12 10 84 7 4 82 5 5 68 Y2 Y3 9 14 7 46 8 14 6 77 8 74 12 74 8 77 7 11 9 26 7 81 8 10 8 84 6 13 6 08 3 10 5 39 9 13 8 15 7 26 6 42 4 74 5 73 Y4 X4 8 6 58 8 5 76 8 7 71 8 8 84 8 8 47 8 7 04 8 5 25 19 12 50 8 5 56 8 7 91 8 6 89 Y2 X Y3 X Y4 X4 All produce the same model Y 0 5X 3 0 Data And Regression For Y X 4 Y 0 5X 3 0 Data And Regression For Y2 X 5 Y2 0 5X 3 0 Data And Regression For Y3 X 6 Y3 0 5X 3 0 Data And Regression For Y4 X4 Y4 0 5X4 3 0 7 Organization of Lecture 8 Diagnosing Consequences of Assumption Violations Diagnosing Consequences of Problematic Issues Assumptions of Ordinary Least Squares Regression 9 The errors are 1 Independent 2 Normally distributed with mean of zero 3 Have a constant variance If Assumptions are Satisfied 10 Sample based regression coefficients betas Y intercept have underlying sampling distribution e g value of b1 is one possible sample estimate from a distribution of all possible sample estimates of true population value B1 Sampling distribution enables statistical inference When assumptions are satisfied the sampling distribution for a given coefficient is normally distributed mean of sampling distribution equals population parameter variability of sampling distribution is minimized In other words 11 Our sample coefficients will provide decent estimates of population parameters because Sample estimate e g b1 is unbiased on average across all possible samples sample estimate will equal the population parameter Standard error of regression estimate will be minimized relative to other forms of regression Assumption of Independence Errors for each observation are unrelated 12 Violating this assumption UNDERESTIMATES the SEB which increases Type I error rate i e t B SEB Violation arises from sampling methods observations that share genetic or contextual influences data from longitudinal i e repeated measures methods Solution Use techniques that model the dependence e g Multi level modeling or time series analyses Assumption of Normality 13 Errors are normally distributed Violation of normality biases the accuracy of the p values of statistical tests non normal errors do not fit t distribution Large sample sizes are robust to violation Can diagnose normality with graphical methods Stem and leaf plot of residuals actual 3 types of residuals 3 Types of Residuals 14 Raw residual error Y Y keyword is r in SAS Standardized residual error that is divided by standard error of the estimate adjusts residual for the expected amount of error keyword is student in SAS Studentized residual residual is adjusted by a standard error of estimate that is computed with the deletion of the given observation keyword is rstudent in SAS Stem Leaf Plot For Normality Histogram of residuals that is turned on its side 15 Residual is divided into two pieces 1 value to left of the decimal forms the stem or Y Axis 2 value to the right of the decimal forms the leaf e g a residual of 1 23 provides a stem of 1 and leaf of 2 Stem values are listed twice once to account for leaves of 0 4 and again for leaves 5 9 E g Residuals 1 68 1 12 0 01 0 11 0 52 0 63 0 77 1 23 1 35 1 54 Stem Leaf 1 5 1 23 0 567 0 01 1 1 1 6 Stem Leaf Plot For Normality 16 Distribution of leaves should be normal i e bell shaped Deviations from bell shape suggest non normality Normal Stem Leaf 17 Stem Leaf of Studentized Residual Stem 1 1 0 0 0 0 1 1 2 Leaf 8 11 02 100 7 6 1 Negatively Skewed Stem Leaf 18 Stem Leaf of Studentized Residual Stem 1 1 0 0 0 0 1 1 2 Leaf 677889 1123 02 1 7 6 1 Obtaining Stem and Leaf In SAS 19 2 Steps 1 output the residuals raw standardized or studentized from proc reg 2 input the residuals into proc univariate and use plot option raw residual standardized residual proc reg model y1 x output out residy1 r rsy1 student stndry1 rstudent sry1 run studentized residual data temp set residy1 proc univariate plot var rsy1 stndry1 sry1 run Correcting Violations of Normality 20 Transform DV with square or square root DV for a positively skewed distribution pulls in the right tail the few large values in positive tail are pulled in DV for a negatively skewed distribution stretches out the right tail large values bunched in positive tail are stretched out 2 if DV has negative values a constant should be added to each score before squaring to maintain the order of the data e g 42 becomes larger than 22 Problem results of regression are in regard to transformed DV Is the transformed DV e g self esteem2 meaningful Assumption of Constant Variance 21 Homoscedasticity the variance of the errors are constant across values of the predictor variable s Heteroscedasticity the variance of the errors change across values of the predictor variable s Homoscedasticity Heteroscedasticity 22 Figure on left distribution of error around the regression line has same variance at each value of X Figure on right variance of the distribution of error around the regression line increases across values of X Consequence of Heteroscedasticity Does not affect the unbiasedness of regression coefficient Does affect the standard error of the regression coefficient when error variance increases across predictor the SEB is underestimated when error variance decreases across predictor the SEB is overestimated Biases the accuracy of the p value such that Type I error rate will not equal i e t B SEB 23 Diagnosing Heteroscedasticity Plot residuals or studentized residuals against the predicted values Because the predicted values are the regression equation we would expect the distribution of residuals to have the same variation across the predicted values 24 Homoscedastic 25 istance of residuals from reference line i e 0 residual actual value predicted value is relatively constant across predicted values Heteroscedastic d 26 distance of residuals from reference line increase across predicted values residuals fan out across predicted values Detecting Heteroscedasticity in SAS 27 use proc reg to plot residuals by predicted values proc reg model y1 x plot r p rstudent p run Correcting
View Full Document