Lecture 7: Multiple Linear Regression Interpretation with different types of predictorsInterpreting regression coefficientsBinary covariatesSlide 4How is MEDSCHL related to LOS?How is MEDSCHOOL related to LOS?R codeScatterplot? Residual Plot?Fitted Values“Linear” regression?MLR example: Add infection risk to our modelWhat about more than two categories?LOS ~ REGIONInterpretationhypothesis tests?RInterpretingMake REGION=4 the referenceR code: recoding so last category is referenceQuite a few differences:But the “model” is the sameDiagnosticsSlide 23Next type: polynomialsScatterplotFitting the modelHow does it fit?Another approach to the same dataSmoothingSlide 30Inference?Implementing a splineHow to interpret?RInterpreting the outputWhy subtract the 250 in defining nurse.star?Why a spline vs. the quadratic?Next timeLecture 7:Multiple Linear RegressionInterpretation with different types of predictorsBMTRY 701Biostatistical Methods IIInterpreting regression coefficientsSo far, we’ve considered continuous covariatesCovariates can take other forms:•binary•nominal categorical•quadratics (or other transforms)•interactionsInterpretations may vary depending on the nature of your covariateBinary covariatesConsidered ‘qualitative’The ordering of numeric assignments does not matterExamples: MEDSCHL: 1 = yes; 2 = noMore popular examples:•gender•mutation•pre vs. post-menopausal•Two age categoriesHow is MEDSCHL related to LOS?How to interpret β1?Coding of variables:•2 vs. 1•i prefer 1 vs. 0•difference? the intercept.Let’s make a new variable: •MS = 1 if MEDSCHL = 1 (yes)•MS = 0 if MEDSCHL = 2 (no)iiieMEDSCHLLOS 10How is MEDSCHOOL related to LOS?What does β1 mean?Same, yet different interpretation as continuousWhat if we had used old coding?101001010101]1|[0]0|[][iiiiiieiiMSLOSEMSLOSEMSLOSEeMSLOS101010101022]2|[1]1|[iiiieiiMEDSCHLLOSEMEDSCHLLOSEeMEDSCHLLOSR code> data$ms <- ifelse(data$MEDSCHL==2,0,data$MEDSCHL)> table(data$ms, data$MEDSCHL) 1 2 0 0 96 1 17 0> > reg <- lm(LOS ~ ms, data=data)> summary(reg)Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 9.4105 0.1871 50.290 < 2e-16 ***ms 1.5807 0.4824 3.276 0.00140 ** ---Residual standard error: 1.833 on 111 degrees of freedomMultiple R-squared: 0.08818, Adjusted R-squared: 0.07997 F-statistic: 10.73 on 1 and 111 DF, p-value: 0.001404Scatterplot? Residual Plot?0.0 0.2 0.4 0.6 0.8 1.0-2 0 2 4 6 8 10data$msresres <- reg$residualsplot(data$ms, res)abline(h=0)0.0 0.2 0.4 0.6 0.8 1.08 10 12 14 16 18 20data$msdata$LOSFitted ValuesOnly two fitted values:Diagnostic plots are not as informativeExtrapolation and Interpolation are meaningless!We can estimate LOS for MS=0.5•LOS = 9.41 + 1.58*0.5 = 10.20•Try to interpret the result…99.1058.141.9]1|ˆ[41.9]0|ˆ[100MSYMSY“Linear” regression?But, what about ‘linear’ assumption?•we still need to adhere to the model assumptions•recall that they relate to the residuals primarily•residuals are independent and identically distributed:The model is still linear in the parameters!),0(~2NiMLR example: Add infection risk to our model> reg <- lm(LOS ~ ms + INFRISK, data=data)> summary(reg)Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 6.4547 0.5146 12.542 <2e-16 ***ms 0.9717 0.4316 2.251 0.0263 * INFRISK 0.6998 0.1156 6.054 2e-08 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.595 on 110 degrees of freedomMultiple R-squared: 0.3161, Adjusted R-squared: 0.3036 F-statistic: 25.42 on 2 and 110 DF, p-value: 8.42e-10 How does interpretation change?What about more than two categories?We looked briefly at region a few lectures back.How to interpret?You need to define a reference categoryFor med school:•reference was ms=0•almost ‘subconscious’ with only two categoriesWith >2 categories, need to be careful of interpretationLOS ~ REGIONNote how ‘indicator’ or ‘dummy’ variable is defined:•I(condition) = 1 if condition is true•I(condition) = 0 if condition is falseiiiiieRIRIRILOS )4()3()2(3210Interpretationβ0 = β1 =β2 =β3 =3020100]4|[]3|[]2|[]1|[iiiiiiiiRLOSERLOSERLOSERLOSEiiiiieRIRIRILOS )4()3()2(3210hypothesis tests?Ho: β1 = 0Ha: β1 ≠ 0What does that test (in words)?Ho: β2 = 0Ha: β2 ≠ 0What does that test (in words)?What if we want to test region, in general?One of our next topics!R > reg <- lm(LOS ~ factor(REGION), data=data)> summary(reg)Call:lm(formula = LOS ~ factor(REGION), data = data)Residuals: Min 1Q Median 3Q Max -3.05893 -1.03135 -0.02344 0.68107 8.47107 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 11.0889 0.3165 35.040 < 2e-16 ***factor(REGION)2 -1.4055 0.4333 -3.243 0.00157 ** factor(REGION)3 -1.8976 0.4194 -4.524 1.55e-05 ***factor(REGION)4 -2.9752 0.5248 -5.669 1.19e-07 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.675 on 109 degrees of freedomMultiple R-squared: 0.2531, Adjusted R-squared: 0.2325 F-statistic: 12.31 on 3 and 109 DF, p-value: 5.376e-07InterpretingIs mean LOS different in region 2 vs. 1?What about region 3 vs. 1 and 4 vs. 1?What about region 4 vs. 3?How to test that?Two options?•recode the data so that 3 or 4 is the reference•use knowledge about the variance of linear combinations to estimate the p-value for the difference in the coefficientsFor now…we’ll focus on the first.Make REGION=4 the referenceOur model then changes:iiiiieRIRIRILOS )1()2()3(32100102030]4|[]3|[]2|[]1|[iiiiiiiiRLOSERLOSERLOSERLOSER code: recoding so last category is reference> data$rev.region <- factor(data$REGION, levels=rev(sort(unique(data$REGION))) )> reg <- lm(LOS ~ rev.region, data=data)>
View Full Document