Course Multiple Regression Topic Regression Diagnostics 1 REPRESENTING NOMINAL PREDICTOR VARIABLES As we have previously discussed MRC is flexible in that it can be used to examine the effects of multiple predictor variables on a dependent variable Those multiple predictor variables need not be measured on quantitative scales Some or all of the predictor variables can be measured on a nominal scale in which the levels of the nominal scale differ qualitatively and not necessarily quantitatively For example we can compare differences among males and females different countries or different conditions of an experiment on some dependent measure Typically differences among the levels of a nominal variable are assessed with analysis of variance ANOVA However given that ANOVA can be conceptualized using the general linear model and as comparisons among linear models it should come as no surprise that ANOVA is a special case of regression Consequently we can analyze nominal variables with the linear model of regression The trick to analyzing a nominal variable in regression is to assign numeric values to represent the levels of the nominal variable Recall that the regression parameters i e B represent the amount by which the dependent variable changes per unit change in the predictor variable Because regression equations don t understand names e g males females drug or placebo we need to represent the names with numeric values so that the regression equation can estimate the amount by which the dependent variable changes with shifts from one level of the nominal variable e g male to another level e g female Furthermore when the nominal variable has more than two levels multiple comparisons among the levels of the nominal variable are necessary to extract all of the information from the variable In particular a variable with G levels requires g 1 comparisons Each g 1 comparison will be represented by a separate predictor variable in the regression analysis For example a two level variable e g sex can be completely explained by one comparison e g male versus female and requires a single predictor variable A two level variable however e g Drug A Drug B Drug C can be completely explained by two orthogonal comparisons and requires two predictor variables in the regression equation There are three systems for numerically coding the levels of a nominal variable that produce meaningful and directly interpretable regression parameters that estimate differences among the levels of a nominal variable The three coding systems which are dummy coding effects coding and contrast coding produce the same results for the overall model i e R2 and F value Consequently when the g 1 predictors for the G level nominal variable are treated as a set to represent the nominal variable the three coding system result in the same conclusion regarding the overall effect of the nominal variable The coding systems however test different hypotheses or address different questions about comparisons among the levels of the nominal variable Consequently the three coding systems produce different estimated regression parameters i e B and statistical tests of those parameters t and p values for the g 1 predictors In the special case in which the nominal variable has only two levels e g male and female the effects coding and contrast coding systems are identical Furthermore the p value for the test of the regression parameter generated by the latter two systems will be equal to the pvalue for the test generated by the dummy coding system The estimated parameter value i e B will differ however because the dummy coding system represents the difference between the two levels of the nominal variable differently than does the effects contrast coding system A DATA SET Course Multiple Regression Topic Regression Diagnostics 2 To facilitate our exploration of the three coding systems we will use the following hypothetical example and data Imagine we are interested in the efficacy of two forms of therapy smiling and exercise for treating depression We randomly assign depressed patients to receive smiling therapy exercise therapy or no therapy and several weeks later measure their depression level assume the scale ranges from 1 8 with higher numbers indicating more severe depression The bogus data are as follows No Therapy 7 7 6 6 Smiling Therapy 2 1 1 2 2 Exercise Therapy 3 1 2 n 4 x 6 50 s 0 58 n 5 x 1 60 s 0 55 n 3 x 2 00 s 1 00 The data were constructed with unequal sample sizes to demonstrate that the regression analysis of the nominal variable can be conducted regardless of whether there are an equal number of observations in the levels of the nominal variable Although the unequal sample sizes do not pose a problem for the analysis we should keep in mind that the reason for the unequal samples could pose a threat to our ability to draw a causal inference from the results Imagine for example that the therapies have no effect on depression Imagine further that the exercise therapy is too difficult for highly depressed persons and causes them to prematurely terminate their participation The resulting low depression mean for the exercise condition would be an artifact of the differential attrition of highly depressed persons across conditions rather than an indication of the efficacy of the exercise therapy For the current purpose of examining the coding systems however let s assume that subject attrition is randomly distributed throughout the conditions Finally to reiterate that ANOVA and regression are really one in the same we can compare the results of our various coding systems to the results of a one factor ANOVA Entering the above data into a one factor ANOVA reveals a significant omnibus effect for the therapy factor F 2 9 64 79 p 0001 Two orthogonal contrasts indicate that depression is greater in the no therapy condition than in the mean of the smiling and exercise therapy conditions F 1 9 123 48 p 0001 and the latter two conditions do not differ in mean level of depression F 1 9 0 64 p 4433 Now let s examine the coding systems DUMMY CODING The dummy coding system treats one of the G levels of the nominal variable as a reference group and creates g 1 predictors that compare the mean of the reference level with the mean of each of the other levels In our therapy data for example we would need predictor variables to account for the effect of therapy With dummy coding we could treat the no therapy condition as the reference
View Full Document