Course Multiple Regression Topic Regression Diagnostics 1 ANALYZING SETS OF VARIABLES Thus far we have used multiple regression to partial from variables the effects of other variables In our salary data for example we were able to partial from publications its shared variability with years since PhD to examine the unique effect of publications on salary In this lecture we will extend the basic concepts of multiple regression to the analysis of variable sets In general a set is a collection of related variables For example imagine we have various demographic variables age sex SES etc and measures of achievement ability GPA SAT GRE number of Olympic gold medals etc In addition to examining whether each of the latter variables age sex SAT Gold Medals etc uniquely predicts say popularity we could examine whether the achievement variables as a group i e as a set predict popularity independent of i e unique from the group i e set of demographic variables Before elaborating on the meaning of a set of variables and extending concepts of multiple regression to the analysis of sets it would be helpful to further develop our understanding of the strategies of simultaneous and hierarchical regression analysis SIMULTANEOUS AND HIERARCHAL ANALYSIS OF VARIABLES We previously briefly discussed the distinction between the simultaneous and hierarchical approaches to regression analysis In the simultaneous approach all of the predictor variables are entered into a single regression model in which the dependent variable is simultaneously regressed onto the predictor variables In our academic salary data for example salary is simultaneously regressed onto PhD Publications and Citations Salary PhD Publications Citations Such a model produces regression coefficients semi partial correlations and partial correlations in which each predictor is fully partialled from all of the other predictors in the model The squared semi partial correlations from this model however are not additive That is the squared semi partial correlation of each predictor do not sum to the R2 of the model In the hierarchical approach the multiple predictor variables are entered sequentially to a series of regression models For example we can sequentially add PhD Publications and Citations to models predicting salary as follows Model 1 Salary PhD Model 2 Salary PhD Publications Model 3 Salary PhD Publications Citations Notice that the final model of the hierarchical approach is identical to the single model used in the simultaneous approach Consequently the partial betas in this 3rd model are identical to the partial betas in the simultaneous approach The benefit of the hierarchical approach however is that it can be used to determine the percentage of variation in the dependent variable that is accounted for by each of the predictor variables In other words it can be used to generate squared semi partial correlations for each predictor that sum to the R2 of the fullest model e g model 3 In particular the difference between the R2 of each successive model is equivalent to the percentage of variability in the dependent variable that is uniquely associated with the variable that is most recently added across models i e successive semi partial correlation The Course Multiple Regression Topic Regression Diagnostics 2 caveat to keep in mind is that the unique partitioning of the total R2 is specific to the order in which the variables are entered across models A different partitioning of the R2 would be obtained if Citations were added in Model 2 as opposed to being added in Model 3 Let s look at some examples to make these abstract concepts a bit more concrete Simultaneous Analysis of the Salary Data The following table contains the SAS code for simultaneously regressing salary onto PhD Publications and Citations SAS Code for Simultaneous Analysis proc reg model salary phd pubs citations scorr2 run SPSS Code for Simultaneous Analysis REGRESSION STATISTICS COEFF OUTS R ANOVA ZPP DEPENDENT salary METHOD ENTER phd pubs citations Note The statements in the STATISTICS subcommand are used to request the various tables provided in the SAS output The following table contains the SAS output Course Multiple Regression Topic Regression Diagnostics 3 SPSS Output for Simultaneous Regression Model Summary R R Square Adjusted R Square Std Error of the Estimate 670 a 449 298 5286 553 Model 1 a Predictors Constant citations pubs phd ANOVA b Model 1 Sum of Squares df Mean Square F Sig 2 983 078 a Regression 250096273 055 3 83365424 352 Residual 307424052 279 11 27947641 116 Total 557520325 334 14 a Predictors Constant citations pubs phd b Dependent Variable salary Coefficients a Unstandardized Coefficients Model 1 Constant Standardized Coefficients B Std Error t Sig Correlations Beta Zero order Partial Part 19796 651 2567 958 7 709 000 phd 363 910 277 580 433 1 311 217 618 368 294 pubs 99 389 376 702 081 264 797 461 079 059 1073 783 954 709 284 1 125 285 507 321 252 citations a Dependent Variable salary The R2 indicates that the model explains roughly 45 of the variability in salary Despite this relatively impressive R2 none of the partial betas for the predictors are significant this is due to the fact that we have low power due to our small sample of 15 observations In this simultaneous regression each predictor has partialled from it the effects of the other predictors Consistent with our previous discussion notice that the semi partial correlations do not sum to the total R2 i e 086 003 063 45 Finally we should note that the F test of the model indicates that our model with PhD Publications and Citations does not predict better than does a model without PhD Publications and Citations F 3 11 2 98 p 0779 again this null effect can be attributed to low power afforded by the small sample size Because we are going to utilize the concept of model comparison throughout the lecture it will be beneficial to examine the R2 version of the F test which was presented in a previous lecture That is the F test in the above SAS output directly compares the R2 of the current model with the R2 of a reduced model that excludes the predictors contained in the current model The formula for the test can be stated as 2 RFull RRe2 stricted F 2 Full 1 R dfR dfF dfF where df refer to the dferror in SAS for the competing models An algebraically equivalent and more convenient expression of the formula is Course Multiple Regression Topic Regression Diagnostics 4 2 RFull RRe2
View Full Document