______________________________________________________________________________________X Y1 Y2 Y3 Y4 X4Stem & Leaf of Studentized ResidualStem & Leaf of Studentized Residual1REGRESSION DIAGNOSTICS:KNOW THY DATA2Know Thy Data-Never blindly accept the results of a regression analysis-Violation of assumptions and other problematic issues can produce misleading or potentially inaccurate results-Examine the results in regard to the nature of your dataAn Example from Anscombe (1973)______________________________________________________________________________________ X Y1 Y2 Y3 Y4 X410 8.04 9.14 7.46 8 6.58 8 6.95 8.14 6.77 8 5.7613 7.58 8.74 12.74 8 7.71 9 8.81 8.77 7.11 8 8.8411 8.33 9.26 7.81 8 8.4714 9.96 8.10 8.84 8 7.04 6 7.24 6.13 6.08 8 5.25 4 4.26 3.10 5.39 19 12.5012 10.84 9.13 8.15 8 5.56 7 4.82 7.26 6.42 8 7.91 5 5.68 4.74 5.73 8 6.893-Different patterns of data can produce the same regression-Y1 = X; Y2 = X; Y3 = X; Y4 = X4 All produce the same model: Y = 0.5X + 3.0Data And Regression For Y = X4Y = 0.5X + 3.0Data And Regression For Y2 = X5Y2 = 0.5X + 3.0Data And Regression For Y3 = X6Y3 = 0.5X + 3.0Data And Regression For Y4 = X4Y4 = 0.5X4 + 3.07Organization of Lecture8-Diagnosing & Consequences of Assumption Violations-Diagnosing & Consequences of Problematic IssuesAssumptions of Ordinary Least Squares Regression9The errors are:(1) Independent(2) Normally distributed with mean of zero(3) Have a constant varianceIf Assumptions are Satisfied10-Sample based regression coefficients (betas & Y-intercept) have underlying sampling distribution -e.g., value of b1 is one possible sample estimate from a distribution of all possible sample estimates of true population value B1-Sampling distribution enables statistical inference-When assumptions are satisfied the sampling distribution fora given coefficient is:-normally distributed-mean of sampling distribution equals population parameter-variability of sampling distribution is minimizedIn other words11-Our sample coefficients will provide decent estimates of population parameters because:-Sample estimate (e.g., b1) is unbiased-on average (across all possible samples) sample estimate will equal the population parameter-Standard error of regression estimate will be minimizedrelative to other forms of regressionAssumption of Independence-Errors for each observation are unrelated12-Violating this assumption UNDERESTIMATES the SEB, which increases Type I error rate (i.e., t = B/SEB)-Violation arises from sampling methods- observations that share genetic or contextual influences-data from longitudinal (i.e., repeated measures) methods-Solution- Use techniques that model the dependencee.g., Multi-level modeling or time-series analysesAssumption of Normality13-Errors are normally distributed-Violation of normality biases the accuracy of the p-values ofstatistical tests – non-normal errors do not fit t-distribution-Large sample sizes are robust to violation -Can diagnose normality with graphical methods-Stem-and-leaf plot of residuals (actual 3 types of residuals)3 Types of Residuals14-Raw residual: error = YYˆ-keyword is “r” in SAS-Standardized residual-error that is divided by standard error of the estimate-adjusts residual for the expected amount of error-keyword is “student” in SAS-Studentized residual-residual is adjusted by a standard error of estimate that is computed with the deletion of the given observation-keyword is “rstudent” in SASStem & Leaf Plot For Normality-Histogram of residuals that is turned on its side15-Residual is divided into two pieces (1) value to left of the decimal forms the stem (or Y-Axis)(2) value to the right of the decimal forms the leaf e.g., a residual of 1.23 provides a stem of 1 and leaf of 2-Stem values are listed twice-once to account for leaves of 0-4 and again for leaves 5-9E.g., Residuals: -1.68, -1.12, 0.01, 0.11, 0.52, 0.63, 0.77, 1.23, 1.35, 1.54 Stem Leaf 1 5 1 23 0 567 0 01-1 1-1 6Stem & Leaf Plot For Normality16-Distribution of leaves should be normal (i.e., bell shaped)-Deviations from bell shape suggest non-normality“Normal” Stem & LeafStem & Leaf of Studentized Residual Stem Leaf 1 8 1 11 0 0 02 -0 100 -0 7 -1 -1 6 -2 1 17“Negatively Skewed” Stem & LeafStem & Leaf of Studentized Residual Stem Leaf 1 677889 1 1123 0 0 02 -0 1 -0 7 -1 -1 6 -2 1 18Obtaining Stem-and-Leaf In SAS19-2-Steps(1) output the residuals (raw, standardized, or studentized) from proc reg(2) input the residuals into proc univariate and use plot optionproc reg;model y1 = x;output out=residy1 r=rsy1 student=stndry1 rstudent=sry1; run;data temp; set residy1;proc univariate plot; var rsy1 stndry1 sry1; run; Correcting Violations of Normalityraw residualstandardized residualstudentized residual20-Transform DV with square or square root-DV for a positively skewed distribution “pulls” in the right tail – the few large values in positive tail are pulled in.-2DV for a negatively skewed distribution “stretches” out the right tail – large values bunched in positive tail are stretched out-if DV has negative values a constant should be added to each score beforesquaring to maintain the order of the data-e.g., -42 becomes larger than -22 -Problem- results of regression are in regard to transformed DV. Is the transformed DV (e.g., self-esteem2) meaningful?Assumption of Constant VarianceHomoscedasticityHeteroscedasticity21-Homoscedasticity – the variance of the errors are constant across values of the predictor variable(s)-Heteroscedasticity – the variance of the errors change acrossvalues of the predictor variable(s)22-Figure on left, distribution of error around the regression line has same variance at each value of X-Figure on right, variance of the distribution of error around the regressionline increases across values of XConsequence of Heteroscedasticity-Does not affect the unbiasedness of
View Full Document