Stanford LING 289 - Logistic Regression - D1377075

Home> Schools> Stanford University> (LING) > LING 289> Logistic Regression

Stanford LING 289 - Logistic Regression

Course Ling 289- History of Computational Linguistics

Pages 15

Download Save

Unformatted text preview:

Logistic regression (with R)Christopher Manning4 November 20071 TheoryWe can transform the output of a linear regression to be suitable for probabilities by using a logit linkfunction on the lhs as follows:logit p = log o = logp1 − p= β0+ β1x1+ β2x2+ · · · + βkxk(1)The odds can var y on a scale o f (0, ∞), so the log odds can vary on the scale of (−∞, ∞) – precisely whatwe get from the rhs of the linear model. For a real-valued explanatory varia ble xi, the intuition here isthat a unit additive change in the value of the variable should change the o dds by a constant multiplicativeamount.Exponentiating, this is equivalent to:1elogit p= eβ0+β1x1+β2x2+···+βkxk(2)o =p1 − p= eβ0eβ1x1eβ2x2· · · eβkxk(3)The inverse of the logit function is the logis tic function. If logit(π) = z, thenπ =ez1 + ezThe logistic function will map any value of the right hand side (z) to a proportion value betwee n 0 and 1,as shown in figure 1.Note a common case with categorical data: If our explanatory variables xiare all binary, then for theones that are false (0), we get e0= 1 and the term disappear s. Similarly, if xi= 1, eβixi= eβi. So we areleft with terms for only the xithat are tr ue (1). For instance, if x3, x4, x7= 1 only, we have:logit p = β0+ β3+ β4+ β7(4)o = eβ0eβ3eβ4eβ7(5)The intuition here is that if I know that a certain fact is tr ue of a data p oint, then that will produce aconstant change in the odds of the outco me (“If he’s European, that doubles the odds that he smokes”).Let L = L(D; B) be the likelihood of the data D given the model, where B = {β0, . . . , βk} are theparameters of the model. The parameters are estimated by the principle of maximum likelihood. Technicalpoint: there is no error term in a logistic r e gression, unlike in linear regressions.1Note that we can convert freely between a probability p and odds o for an event versus its complement:o =p1 − pp =oo + 11Logistic function-6 -4 -2 0 2 4 60.0 0.2 0.4 0.6 0.8 1.0Figure 1: The logistic function2 Basic R logistic regression modelsWe will illustrate with the Cedegren dataset on the website.cedegren <- read.table("cedegren.txt", header=T)You need to create a two-column matrix of success/fa ilure counts for your response variable. You cannotjust use percentages. (You can give percentages but then weight them by a count of succ e ss + failur e s.)attach(cedegren)ced.del <- cbind(sDel, sNoDel)Make the logistic r e gression model. The shorter second form is equivalent to the first, but don’t omitsp e c ifying the family.ced.logr <- glm(ced.del ~ cat + follows + factor(class), family=binomial("logit"))ced.logr <- glm(ced.del ~ cat + follows + factor(class), family=binomial)The output in more and less detail:> ced.logrCall: glm(formula = ced.del ~ cat + follows + factor(class), family = binomial("logit"))Coefficients:(Intercept) catd catm catn catv followsP-1.3183 -0.1693 0.1786 0.6667 -0.7675 0.9525followsV factor(class)2 factor(class)3 factor(class)40.5341 1.2704 1.0480 1.3742Degrees of Freedom: 51 Total (i.e. Null); 42 ResidualNull Deviance: 958.7Residual Deviance: 198.6 AIC: 446.1> summary(ced.logr)Call:glm(formula = ced.del ~ cat + follows + factor(class), family = binomial("logit"))Deviance Residuals:Min 1Q Median 3Q Max2-3.24384 -1.34325 0.04954 1.01488 6.40094Coefficients:Estimate Std. Error z value Pr(>|z|)(Intercept) -1.31827 0.12221 -10.787 < 2e-16catd -0.16931 0.10032 -1.688 0.091459catm 0.17858 0.08952 1.995 0.046053catn 0.66672 0.09651 6.908 4.91e-12catv -0.76754 0.21844 -3.514 0.000442followsP 0.95255 0.07400 12.872 < 2e-16followsV 0.53408 0.05660 9.436 < 2e-16factor(class)2 1.27045 0.10320 12.310 < 2e-16factor(class)3 1.04805 0.10355 10.122 < 2e-16factor(class)4 1.37425 0.10155 13.532 < 2e-16(Dispersion parameter for binomial family taken to be 1)Null deviance: 958.66 on 51 degrees of freedomResidual deviance: 198.63 on 42 degrees of freedomAIC: 446.10Number of Fisher Scoring iterations: 4Residual deviance is the difference in G2= −2 log L between a maximal model that has a separateparameter for each cell in the model and the built model. Changes in the deviance (the difference in thequantity −2 lo g L) for two models which can be nested in a reduction w ill be approximately χ2-distributedwith dof equal to the change in the number of estimated parameters. Thus the difference in deviances can betested against the χ2distribution for significance. The same conce rns about this a pproximation being validonly for reas onably sized expected counts (as with contingency tables and multinomials in Suppes (1970))still apply here, but we (and most people) ignore this caution and use the statistic as a rough indicator whenexploring to find good models.We’re usually mainly interes ted in the relative goodness of models, but nevertheless, the high residual de-viance shows that the model cannot be accepted to have been likely to generate the data (pchisq(198.63, 42)≈1). However , it certainly fits the da ta better than the null model (which means that a fixed mean probabilityof deletion is used for all cells): pchisq(958.66-198.63, 9)≈ 1.What can we see from the parameters of this model? catd and catm have different effects, but both arenot very clearly significantly different fro m the effect of cata (the default value). All following environmentsseem distinctive. For class, all of class 2 –4 seem to have somewhat similar effects, and we might model classas a two way distinction. It seems like we cannot pro fita bly drop a whole factor, but we c an test that withthe anova function to give an analysis of deviance table, or the drop1 function to try dropping each factor :> anova(ced.logr, test="Chisq")Analysis of Deviance TableModel: binomial, link: logitResponse: ced.delTerms added sequentially (first to last)Df Deviance Resid. Df Resid. Dev P(>|Chi|)NULL 51 958.66cat 4 314.88 47 643.79 6.690e-673follows 2 228.86 45 414.93 2.011e-50factor(class) 3 216.30 42 198.63 1.266e-46> drop1(ced.logr, test="Chisq")Single term deletionsModel:ced.del ~ cat + follows + factor(class)Df Deviance AIC LRT Pr(Chi)<none> 198.63 446.10cat 4 368.76 608.23 170.13 < 2.2e-16follows 2 424.53 668.00 225.91 < 2.2e-16factor(class) 3 414.93 656.39 216.30 < 2.2e-16The ANOVA test tries adding the fa c tors only in the order given in the model formula (left to right). Ifthings are close, you should try rear ranging the model formula ordering, or using drop1, but given the hugedrops in

View Full Document


School:
Email:
New Password:
Confirm Password:

Stanford LING 289 - Logistic Regression

Sign up for free to view:

Please select your school