Stat 544, Lecture 13 1'&$%Fitting GeneralizedLinear ModelsLast time, we introduced the elements of the GLIM:• The response y, with distributionf(y; θ)=expjyθ − b(θ)a(φ)+ c(y, φ)ff,where θ is the canonical parameter. from whichwe showed that E(y)=μ = b(θ) andVar(y)=a(φ)b(θ).• The linear predictorη = xTβ,where x is a vector of covariates and β is to beestimated.• The link function, which connects η to μ,g(μ)=η.In this notation, the subscript i has been suppressed.Stat 544, Lecture 13 2'&$%In many cases, a(φ) will have the forma(φ)=φ/w,where φ is the dispersion parameter and w is a knownweight.Example: normal response. Under the normalmodel y ∼ N(μ, σ2), the log-density islog f = −12σ2(y − μ)2−12log`2πσ2´=yμ − μ2/2σ2−y22σ2−12log`2πσ2´.Therefore, the canonical parameter is θ = μ, and theremaining elements are b(θ)=μ2/2 and φ = σ2,a(φ)=φ.In a heteroscedastic model y ∼ N(μ, σ2/w ), where wis a known weight, we would have φ = σ2anda(φ)=φ/w.Example: binomial response. If y ∼ n−1Bin(n, π),then the log-probability mass function islog f = log n! − log(ny)! − log(n(1 − y))!+ ny log π + n(1 − y) log(1 − π)Stat 544, Lecture 13 3'&$%=y log“π1−π”+ log(1 − π)1/n+ c,where c doesn’t involve π. Therefore, the canonicalparameter isθ = log„π1 − π«,the b-function isb(θ)=−log(1 − π) = log“1+eθ”,the dispersion parameter is φ = 1, and a(φ)=φ/n.Canonical link. We showed that the canonicalparameter for y ∼ N(μ, σ2)isθ = μ, and thecanonical parameter for ny ∼ Bin(n, π)isθ = logit(π).Notice that the most commonly used link function fora normal model is η = μ, and the most commonlyused link function for the binomial model usη = logit(π). This is no accident. When η = θ,wesaythat the model has a canonical link. Canonical linkshave some nice properties, which we will discuss.It is often convenient to use a canonical link. Butconvenience does not imply that the data actuallyconform to it. It will be important to check theStat 544, Lecture 13 4'&$%appropriateness of the link through diagnostics,whether or not the link is canonical.Fitting generalized linear models via Fisherscoring. ML estimation for β may be carried out viaFisher scoring,β(t+1)= β(t)+h−El(β(t))i−1l(β(t)),where l is the loglikelihood function for the entiresample y1,...,yN.Temporarily changing the notation, we will now let l,land ldenote the contribution of a singleobservation yi= y to the loglikelihood and itsderivatives. We do this for simplicity, understandingthat the corresponding functions for the entire sampleare obtained by summing the contributions over thesample units i =1,...,N.Ignoring constants, the loglikelihood isl(θ; y)=yθ − b(θ)a(φ).The contribution of y = yito the jth element of theStat 544, Lecture 13 5'&$%score vector is∂l∂βj=„∂l∂θ«„∂θ∂μ«„∂μ∂η«„∂η∂βj«.The first factor is∂l∂θ=y − b(θ)a(φ)=y − μa(φ).Because μ = b(θ), the second factor is∂θ∂μ=1b(θ)=1V (μ)=a(φ)Var(y),where V (μ)=b(θ) is the variance function that wediscussed last time.The third factor, ∂μ/∂η, will depend on the linkfunction. The fourth factor is ∂η/βj= xij, where xijis the jth element of the covariate vector xi= x forthe ith observation. Putting it all together, we have∂l∂βj=y − μVar(y)„∂μ∂η«xij.If we are using the canonical link η = θ, then∂μ/∂η = ∂μ/∂θ = b(θ), so the score becomes∂l∂βj=y − μVar(y)b(θ) xij=y − μa(φ)xij.To find the expected second derivatives, we can useStat 544, Lecture 13 6'&$%the property−E„∂2l∂βj∂βk«= E»„∂l∂βj«„∂l∂βk«–= E„y − μVar(y)«2„∂μ∂η«2xijxik=1Var(y)„∂μ∂η«2xijxik.With the canonical link, this becomesE„∂2l∂βj∂βk«= −b(θ)a(φ)xijxik.But under the canonical link, the actual secondderivative is∂2l∂βj∂βk=∂∂βk„∂l∂θ«„∂θ∂βj«=„∂l∂θ«„∂2θ∂βj∂βk«+„∂θ∂βj«„∂2l∂θ2«„∂θ∂βk«.=0+„∂2l∂θ2«xijxik.Also, we saw last time that∂2l∂θ2= −b(θ)a(φ),so with the canonical link, the actual secondStat 544, Lecture 13 7'&$%derivatives equal the observed second derivatives, andFisher scoring is the same thing as Newton-Raphson.Under the canonical link, the second derivatives areconstant with respect to the data y.Putting it together. For an arbitrary link, we havejust shown that∂l∂βj=y − μVar(y)„∂μ∂η«xij,−E„∂2l∂βj∂βk«=1Var(y)„∂μ∂η«2xijxik.It follows that the score vector for the entire data sety1,...,yNcan be written as∂l∂β= XTA(y − μ),where X =(x1,...,xN)T,A = Diag»[Var(yi)]−1„∂μi∂ηi«–,= Diag»Var(yi)„∂ηi∂μi«–−1,and y and μ now denote the entire vectorsy =(y1,...,yN)T,Stat 544, Lecture 13 8'&$%μ =(μ1,...,μN)T.The expected Hessian matrix becomes−E„∂2l∂β∂βT«= XTWX,whereW = Diag"[Var(yi)]−1„∂μi∂ηi«2#= Diag"Var(yi)„∂ηi∂μi«2#−1.An iteration of Fisher scoring is thenβ(t+1)= β(t)+“XTWX”−1XTA (y − μ) ,where W , A and μ are calculated from β(t).Iteratively reweighted least squares. Recall thata heteroscedastic normal model is fit by weightedleast squares (WLS),ˆβ =“XTWX”−1XTWy,where y is the response and W is the diagonal matrixof weights, which is equivalent to OLS regression ofStat 544, Lecture 13 9'&$%W1/2y on W1/2X.We can arrange the step of Fisher scoring to make itresemble WLS. First, rewrite it asβ(t+1)=“XTWX”−1hXTWXβ(t)+ XTA (y − μ)i.Then note that Xβ =(η1,...,ηN)T= η. Also, notethat A and W are related byA = W„∂η∂μ«,where ∂η/∂μ = Diag(∂ηi/∂μi). Therefore, we canwrite it asβ(t+1)=“XTWX”−1XTWz,wherez = η +„∂η∂μ«(y − μ)=(z1,...,zN)T,where zi= ηi+(∂ηi/∂μi)(yi− μi). In the GLIMliterature, ziis often called the adjusted dependentvariate or the working variate. Fisher scoring cantherefore be regarded as iteratively reweighted leastsquares (IRWLS) carried out on a transformed versionStat 544, Lecture 13 10'&$%of the response variable.At each cycle, we• use the current estimate for β to calculate a newworking variate z and a new set of weights W ,and then• regress z on X using weights W to get theupdated β.Viewing Fisher scoring as IRWLS makes it easy toprogram this
View Full Document