Lecture 14: Weighted Least SquaresNancy R. ZhangStatistics 191, Stanford UniversityMarch 5, 2008Nancy R. Zhang (Statistics 191) Lecture 14 March 5, 2008 1 / 25Review - Binary responses data exampleA clinic sent fliers to itsclients to encourageeveryone, but especiallyolder persons, to get a flushot for protection against anexpected flu epidemic.150 clients randomlysampled2Y : did they get flu shot?3Predictor variables: Age,health awareness.Nancy R. Zhang (Statistics 191) Lecture 14 March 5, 2008 2 / 25Review - Binary responses modelModel: Y ∈ {0, 1},P(Y = 1|X1, . . . , Xp) = g−1(β1X1+ β2X2+ . . . βpXp).Whereg(π) = logπ1 − π.The inverse g−1isg−1(z) =ez1 + ez.We have no choice but to accept non-constant variance,Var(Y ) = π(X )[1 − π(X )].Nancy R. Zhang (Statistics 191) Lecture 14 March 5, 2008 3 / 25Review - Model interpretationAn intuitive quantity to assess probabilities:odds =P(Y = 1|X )P(Y = 0|X ).In the logistic regression model,log(odds) = βX .The parameter β is the contribution of unit increase in X to theincrease (decrease) in odds. For example, if X were binary as well,logodds(X = 1)odds(X = 0)= β.Nancy R. Zhang (Statistics 191) Lecture 14 March 5, 2008 4 / 25Review - Model fittingFitting can be done by Newton-Raphson:1Let u0= (δl(β)δβi)i=1,...,pbe the gradient vector.2Let H be the Hessian matrix hi,j=δ2l(β)δβiδβj.3Start with an initial β(0), then iterate β(t+1)= β(t)− (H(t))−1u(t).The idea is, for each iteration t, to approximate l(β) locally by aquadratic:l(β) = l(β(t)) + u(t)0(β − β(t)) +12(β − β(t))0H(t)(β − β(t)),and solve for δl(β)/δβ ≈ u(t)+ H(t)(β − β(t)) = 0.For logistic regression model,β(t+1)= β(t)+ {X0diag[π(t)i(1 − π(t)i)]X }(−1)X0(y − π(t)).This is equivalent to doing a weighted linear regression at each step.Nancy R. Zhang (Statistics 191) Lecture 14 March 5, 2008 5 / 25Inference for βIn Gaussian case:ˆβ = (X0X )−1X0Y , Y ∼ N(Xβ, σ2I).Since Gaussian vectors remain Gaussian under linear transforms,ˆβ ∼ N(β, (X0X )−1σ2).For logistic regression,ˆβ is no longer linear in Y . However,asymptotically (i.e. n large), it is Gaussian. It’s covariance can beestimated by\cov(ˆβ) = (X0diag[ˆπi(1 − ˆπ(t)i)]X )−1.From the square root of the diagonals elements of the above matrixyou can get\s.e.(ˆβ).Nancy R. Zhang (Statistics 191) Lecture 14 March 5, 2008 6 / 25Wald tests for β1Confidence intervals for β:[ˆβ − zα/2\s.e.(ˆβ),ˆβ + zα/2\s.e.(ˆβ)]2Two sided test H0: β = 0, reject ifˆβ\s.e.(ˆβ)> zα/23Test of constraint H0: Cj×pβp×1= hj×1, reject if(Cˆβ − h)0(C\cov(ˆβ)C0)−1(Cˆβ − h)is larger than χ2j,1−α.Nancy R. Zhang (Statistics 191) Lecture 14 March 5, 2008 7 / 25Assessment of model fitIn linear regression, we used the F -test:F =[SSE(RM) −SSE(FM)]/[∆df ]SSE(FM)/[n − df (FM)].F ∼ F∆df ,n−df (FM).The analogous quantity of SSE for non-linear models is deviance:Deviance(ˆβ) = −2[l(˜β, Y ) − l(ˆβ, Y )],wherel(·, Y ) is log-likelihood,˜β is fit of data using saturated model (n predictors).Nancy R. Zhang (Statistics 191) Lecture 14 March 5, 2008 8 / 25Assessment of model fitFor Gaussian model,Deviance(ˆβ) =nXi=1(yi−ˆyi)2.For Logistic model,Deviance(ˆβ) = 2nXi=1YilogYiˆπi+ (1 − Yi) log1 − Yi1 − ˆπi= 2Xobserved × logobservedfitted.Convention: 0 × log 0 = 0.Nancy R. Zhang (Statistics 191) Lecture 14 March 5, 2008 9 / 25Nested Chi-squared tests of model fitAs for SSE, the greater the deviance, the poorer the fit. If reducedmodel (RM) were true, thenDeviance(RM) − Deviance(FM) → χ2df (FM)−df (RM).Thus, reject RM at asymptotic level alpha ifDeviance(RM) − Deviance(FM) > χ2df (FM)−df (RM),1−α.Nancy R. Zhang (Statistics 191) Lecture 14 March 5, 2008 10 / 25Model diagnosisIn linear regression, the standardized residuals were used to diagnosemodel fit.ri= yi−ˆyi, r∗i=ribσri=rip1 − pii.The analogous quantity here is the Pearson residual,ri=yi− ˆπipˆπi(1 − ˆπi),r∗i=riq1 −ˆhii.whereˆhiiare diagonals ofHat = W1/2X (X0WX )−1X0W1/2.W = diag[πi(1 − πi)].Nancy R. Zhang (Statistics 191) Lecture 14 March 5, 2008 11 / 25Model diagnosisAnother quantity you can use is the deviance residuals:Deviance(ˆβ) = 2nXi=1YilogYiˆπi+ (1 − Yi) log1 − Yi1 − ˆπi= 2Xobserved × logobservedfitted.So let dibe the contribution of data point i to the above measure ofmis-fit:di= YilogYiˆπi+ (1 − Yi) log1 − Yi1 − ˆπiDeviance residual:pdi× sign(yi− ˆπi).Nancy R. Zhang (Statistics 191) Lecture 14 March 5, 2008 12 / 25Model Selection1Nested models: Deviance χ2. Thus, reject RM at asymptotic levelα ifDeviance(RM) − Deviance(FM) > χ2df (FM)−df (RM),1−α.Deviance = −2[l(˜β, Y ) − l(ˆβ, Y )]2Non-nested models:1AIC −2loglik + 2p2Cp, but you would be doing a local asymptotic approximation usingPearson’s residuals.ri=yi− ˆπipˆπi(1 − ˆπi), Cp=Xir2i+ 2pˆσ2,3BIC −2loglik + p log nNancy R. Zhang (Statistics 191) Lecture 14 March 5, 2008 13 / 25Multinomial DataIf the response Y belongs to K categories.1Designate one category as the “base” category.2P(Y = k|X ) =eX βk1 +PK −1l=1eX βlHere, βk= (βk1, . . . , βkp)0.P(Y = K |X ) =11 +PK −1l=1eX βl3p × (K − 1) parameters.4βkifor k-th category and i-th predictor interpreted as increase inlog-odds from base category.Nancy R. Zhang (Statistics 191) Lecture 14 March 5, 2008 14 / 25Count data1Men and women were asked whether they believed in the after life(1991 General Social Survey).2Results:Y N or UM 435 147 582F 375 134 509Total 810 281 10913Question: is belief in afterlife independent of gender?Nancy R. Zhang (Statistics 191) Lecture 14 March 5, 2008 15 / 25Contingency TablesY N or UM 435 147 582F 375 134 509Total 810 281 10911Model: Yij∼ Poisson(λij).2H0: Independence. i.e. λij= λαiβj.3HA: λijarbitrary.4Pearson’s χ2Test:χ2=Xi,j(Yij− Eij)2Eij∼ χ21(under H0)5Why 1 df? Independence model has 5 (λ, 2 α’s, 2 β’s) parameters,2 constraints ⇒ 3 df. Unrestricted model has 4 parameters.Nancy R. Zhang (Statistics 191) Lecture 14 March 5, 2008 16 / 25Under independence:log E(Yij) = log λij= log λ + log αi+ log βj.What about variance? Because the data is Poisson,Var(Yij) = E(Yij) = λij.Thus, the variance scales with the mean.Log stabilizes variance.But unlike before, we are explicit modeling data as
View Full Document