Course Multiple Regression Topic Correlation Regression With Multiple Variables 1 CORRELATION REGRESSION WITH MULTIPLE VARIABLES We previously discussed the bivariate case of correlation and regression in which we examined the association between two variables and estimated a model linking one variable to the other Today we will begin our exploration of multiple correlation and regression MCR in which we account for the shared associations among multiple variables As you will discover bivariate correlation and regression can provide a deceptive or inaccurate picture of the actual association between two variables The academic salary data provides a glimpse into the potentially deceptive lens of bivariate associations The following table contains the bivariate or zero order correlations among all of the variables in the academic salary data set these correlations can be obtained in SAS by specifying proc corr var salary phd pubs sex citations run Variable Salary PhD Pubs Sex Citations Salary 62 46 26 51 PhD 68 15 46 Pubs 05 30 Sex 01 Citations Note Sex is coded such that 0 male and 1 female If we focus only on the correlations in the Salary column we might conclude that salary increases with increases in years since earning a PhD number of publications and number of citations and that females have lower salary s than males However an examination of the remaining columns might raise suspicion in our initial conclusion Notice for example that years since earning a PhD and publications share a modest bivariate association Consequently some of the association between years since PhD and Salary might actually be attributable to number of publications Likewise notice that publications and citations are also correlated Consequently some of the correlation between salary and publications may contain information about the association between salary and citations Furthermore there maybe some unmeasured variable that is related to salary and the other variables and the association among salary and the other variables might really be attributed to the unmeasured variable This example demonstrates that bivariate correlations do not always reveal the true association between variables because the bivariate correlation does not remove the influence of other variables To accurately assess the association between X and Y we need to remove from the bivariate association the influence of all other variables that are associated with both X and Y Obviously this is a formidable task and is precisely what MRC helps us accomplish Keep in mind however that MRC can suffer the same weakness of bivariate correlation and regression When important variables are excluded from a MRC analysis the associations revealed by the MRC analysis have not been purified of the shared associations with the excluded variables The only pure panacea to the problems of correlated variables and unmeasured influences is gathering data with the experimental method Unfortunately many issues of study are not amenable to an experimental design e g we can t randomly assign persons to be male or female What follows are the basics of MRC Course Multiple Regression Topic Correlation Regression With Multiple Variables 2 MULTIPLE REGRESSION Multiple regression provides a linear model of the association between a dependent or criterion variable and multiple independent or predictor variables The advantage of multiple regression is that it unconfounds the effect of each predictor variable on the dependent variable from the effect of the other predictor variables in the model The basic notation for a linear model is Y AY 12 BY 1 2 X 1 BY 2 1 X 2 The betas B s in multiple regression are partial betas and reflect the effect of each variable on Y controlling for the effect of the other variables That is BY 1 2 reflects the effect of X1 on Y controlling for the effect of X2 Likewise BY 2 1 reflects the effect of X2 on Y controlling for the effect of X1 The betas are referred to as partial betas because each beta has partialled from it the correlated effects of other variables in the model The Y intercept AY 12 indicates the predicted value of Y when all of the X s are equal to zero i e the value of Y at which the regression line crosses the Y axis Keeping in mind that the betas are partial regression coefficients we can simplify the notation of the model as follows Y A B1 X 1 B2 X 2 We ll use the salary example to demonstrate the difference between bivariate and multiple regression The following bivariate regression models reflect the bivariate association between salary and publications and citations respectively Salary 21106 566Pubs Salary 22976 1918Citations The first model indicates that salary increases 566 with each publication The second model indicates that salary increases 1 918 each time a publication is cited Given that publications and citations are correlated r 30 see table of bivariate correlations the beta linking salary to publications may convey some of the effect of citations on salary and vice versa for the beta linking salary to citations Multiple regression however will remove the overlapping information from each beta and reveal the effect of publications on salary that is independent of citations and the effect of citations on salary that is independent of publications For example the multiple regression model in which salary is simultaneously regressed on publications and citations is Salary 20285 418Pubs 1536Citations Notice that the partial betas for publications and citations are smaller than their respective bivariate betas Note that partial betas can at times be larger than the bivariate betas We ll talk about such a suppression effect later This model indicates that salary increases 418 per publication and 1 536 per citation Estimating the Partial Regression Parameters of a Multiple Regression in SAS The formulas for estimating partial regression parameters is derived from differential calculus and become exponentially more complicated as additional predictors are added to the Course Multiple Regression Topic Correlation Regression With Multiple Variables 3 regression model The formulas for the relatively easy case in which there are only two predictor variables are BY 1 2 rY 1 rY 2 r12 1 r122 Y X1 and BY 2 1 rY 2 rY 1 r12 1 r122 Y X 2 where rY1 represents the correlation between Y and X1 rY2 represents the correlation of Y and X2 r12 represents the correlation between X1and X2 and represents the standard deviation
View Full Document