Outline 1 Logistic regression fitting the model Components of generalized linear models Logistic regression Case study runoff data Case study baby food 2 Logistic regression Inference Model fit and model diagnostics Comparing models Sparse data and the separation problem Modeling non normal data In all of the linear models we have seen so far the response variable has been modeled with a normal distribution response fixed parameters normal error For many data sets this model is inadequate Ex if the response variable is categorical with two possible responses it makes no sense to model the outcome as normal Ex if the response is always a small positive integer its distribution is also not well described by a normal distribution Generalized linear models GLMs are an extension of linear models to model non normal response variables Logistic regression is for binary response variables The link function Standard linear model yi 1 xi1 2 xi2 k xik ei ei N 0 2 The mean of expected value of the response is IE yi 1 xi1 2 xi2 k xik We will use the notation i 1 xi1 k xik to represent the linear combination of explanatory variables In a standard linear model IE yi i In a GLM there is a link function g between and the mean of the response variable g IE yi i For standard linear models the link function is the identity function g yi yi The link function It can be easier to consider the inverse of the link function IE yi g 1 i When the response variable is binary with values coded as 0 or 1 the mean is simply IEy IP y 1 A useful function for this case is IEy IP y 1 e g 1 1 e can take any value the mean is always between 0 and 1 The corresponding link function is called the logit function p IP Y 1 g p log log 1 p IP Y 0 It is the log of the odds Regression under this model is called logistic regression Deviance In standard linear models we estimate the parameters by minimizing the sum of the squared residuals Equivalent to finding parameters that maximize the likelihood In a GLM we also fit parameters by maximizing the likelihood The deviance is negative two times the maximum log likelihood up to an additive constant Estimation is equivalent to finding parameter values that minimize the deviance Logistic regression Logistic regression is a natural choice when the response is categorical with two possible outcomes Pick one outcome to be a success or yes where y 1 We desire a model to estimate the probability of success as a function of the explanatory variables Using the inverse logit function the probability of success has the form e 1 IP y 1 1 e 1 e Equivalent formulas IP Y 1 IP y 1 e log IP y 0 IP Y 0 We estimate the parameters so that this probability is high for cases where y 1 and low for cases where y 0 Anesthesia example In surgery it is desirable to give enough anesthetic so that patients do not move when an incision is made It is also desirable not to use much more anesthetic than necessary In an experiment patients are given different concentrations of anesthetic Response whether or not they move at the time of incision 15 minutes after receiving the drug Anesthesia data Move No move Total Proportion 0 8 6 1 7 0 17 1 0 4 1 5 0 20 Concentration 1 2 1 4 1 6 2 2 0 4 4 4 6 6 4 0 67 0 67 1 00 2 5 0 2 2 1 00 Analyze in R with glm twice once using raw data 0 s and 1 s and once using summarized counts 1 7 1 4 4 4 2 2 Extends chi square tests Binomial distribution Logistic regression is related to the binomial distribution If there are several observations with the same explanatory variable values then the individual responses can be added up and the sum has a binomial distribution Recall the binomial distribution has parameters n and p mean np and variance 2 np 1 p The probability distribution is n x IP X x p 1 p n x x Logistic regression is in the binomial family of GLMs Logistic regression in R on raw data dat read table anesthetic txt header T str dat data frame 30 obs of 3 variables movement Factor w 2 levels move noMove 2 1 2 1 1 conc num 1 1 2 1 4 1 4 1 2 2 5 1 6 0 8 1 6 1 4 nomove int 1 0 1 0 0 1 1 0 1 0 dat movement 1 noMove move noMove move move 21 noMove move noMove move noMove Levels move noMove fit raw glm movement conc data dat family binomial summary fit raw glm formula nomove conc family binomial data dat Estimate Std Error z value Pr z Intercept 6 469 2 418 2 675 0 00748 conc 5 567 2 044 2 724 0 00645 Null deviance 41 455 on 29 degrees of freedom Residual deviance 27 754 on 28 degrees of freedom AIC 31 754 Fitted Model e 1 1 e 1 e with 6 469 5 567 concentration IP No move We can get predictions at the link level i and at the response level y or IEY IP Y 1 predict fit raw type link 1 2 3 4 5 6 28 29 0 90 0 21 1 32 1 32 0 21 7 448 0 21 0 90 30 0 21 predict fit raw type response 1 2 3 4 5 6 28 0 29 0 55 0 79 0 79 0 55 0 999 0 55 30 0 55 29 0 29 Plot of the logit curve layout matrix 1 2 2 1 my etas seq 8 8 by 01 my prob 1 1 exp my etas plot my etas my prob type l bty n xlab linear predictor log odds eta ylab probability of success abline h 0 abline h 1 lines c 10 0 c 5 5 lty 2 lines c 0 0 c 0 5 lty 2 my conc seq 0 2 5 by 05 my etas 6 469 5 567 my conc my prob 1 1 exp my etas plot my conc my prob type l bty n adj 1 xlab ylab prob no movement mtext concentration side 1 line 0 4 mtext eta side 1 line 2 4 mtext 6 5 n intercept side 1 at 0 line 4 mtext 0 9 n 6 5 5 6 side 1 at 1 line 4 conc 5 0 6 469 5 567 mtext 0 side 1 at conc 5 line 3 mtext 4 7 n 6 5 2 5 6 side 1 at 2 line 4 lines c 1 conc 5 c 5 5 lty 2 lines c conc 5 conc 5 c 0 5 lty 2 Plot of movement probability versus concentration plot movement conc data dat plot movement as factor conc data dat plot …
View Full Document