**Unformatted text preview:**

Chapter 5Log-Linear Models forContingency TablesIn this chapter we study the application of Poisson regression models tothe analysis of contingency tables. This is perhaps one of the most popularapplications of log-linear models, and is based on the existence of a veryclose relationship between the multinomial and Poisson distributions.5.1 Models for Two-dimensional TablesWe start by considering the simplest possible contingency table: a two-by-two table. However, the concepts to be introduced apply equally well tomore general two-way tables where we study the joint distribution of twocategorical variables.5.1.1 The Heart Disease DataTable 5.1 was taken from the Framingham longitudinal study of coronaryheart disease (Cornfield, 1962; see also Fienberg, 1977). It shows 1329 pa-tients cross-classified by the level or their serum cholesterol (below or above260) and the presence or absence of heart disease.There are various sampling schemes that could have led to these data,with consequences for the probability model one would use, the types ofquestions one would ask, and the analytic techniques that would be em-ployed. Yet, all schemes lead to equivalent analyses. We now explore severalapproaches to the analysis of these data.G. Rodr´ıguez. Revised November, 2001; minor corrections August 2010, February 20142 CHAPTER 5. LOG-LINEAR MODELSTable 5.1: Serum Cholesterol and Heart DiseaseSerum Heart DiseaseTotalCholesterol Present Absent< 260 51 992 1043260+ 41 245 286Total 92 1237 13295.1.2 The Multinomial ModelOur first approach will assume that the data were collected by sampling 1329patients who were then classified according to cholesterol and heart disease.We view these variables as two responses, and we are interested in their jointdistribution. In this approach the total sample size is assumed fixed, and allother quantities are considered random.We will develop the random structure of the data in terms of the row andcolumn variables, and then note what this implies for the counts themselves.Let C denote serum cholesterol and D denote heart disease, both discretefactors with two levels. More generally, we can imagine a row factor with Ilevels indexed by i and a column factor with J levels indexed by j, formingan I × J table. In our example I = J = 2.To describe the joint distribution of these two variables we let πijdenotethe probability that an observation falls in row i and column j of the table.In our example words, πijis the probability that serum cholesterol C takesthe value i and heart disease D takes the value j. In symbols,πij= Pr{C = i, D = j}, (5.1)for i = 1, 2, . . . , I and j = 1, 2, . . . , J. These probabilities completely describethe joint distribution of the two variables.We can also consider the marginal distribution of each variable. Let πi.denote the probability that the row variable takes the value i, and let π.jdenote the probability that the column variable takes the value j. In ourexample πi.and π.jrepresent the marginal distributions of serum cholesteroland heart disease. In symbols,πi.= Pr{C = i} and π.j= Pr{D = j}. (5.2)Note that we use a dot as a placeholder for the omitted subscript.The main hypothesis of interest with two responses is whether they areindependent. By definition, two variables are independent if (and only if)5.1. MODELS FOR TWO-DIMENSIONAL TABLES 3their joint distribution is the product of the marginals. Thus, we can writethe hypothesis of independence asH0: πij= πi.π.j(5.3)for all i = 1, . . . , I and j = 1, . . . , J. The question now is how to estimatethe parameters and how to test the hypothesis of independence.The traditional approach to testing this hypothesis calculates expectedcounts under independence and compares observed and expected counts us-ing Pearson’s chi-squared statistic. We adopt a more formal approach thatrelies on maximum likelihood estimation and likelihood ratio tests. In orderto implement this approach we consider the distribution of the counts in thetable.Suppose each of n observations is classified independently in one of theIJ cells in the table, and suppose the probability that an observation fallsin the (i, j)-th cell is πij. Let Yijdenote a random variable representingthe number of observations in row i and column j of the table, and let yijdenote its observed value. The joint distribution of the counts is then themultinomial distribution, withPr{Y = y} =n!y11!y12!y21!y22!πy1111πy1212πy2121πy2222, (5.4)where Y is a random vector collecting all four counts and y is a vectorof observed values. The term to the right of the fraction represents theprobability of obtaining y11observations in cell (1,1), y12in cell (1,2), andso on. The fraction itself is a combinatorial term representing the numberof ways of obtaining y11observations in cell (1,1), y12in cell (1,2), and soon, out of a total of n. The multinomial distribution is a direct extensionof the binomial distribution to more than two response categories. In thepresent case we have four categories, which happen to represent a two-by-two structure. In the special case of only two categories the multinomialdistribution reduces to the familiar binomial.Taking logs and ignoring the combinatorial term, which does not dependon the parameters, we obtain the multinomial log-likelihood function, whichfor a general I × J table has the formlog L =IXi=1JXj=1yijlog(πij). (5.5)To estimate the parameters we need to take derivatives of the log-likelihoodfunction with respect to the probabilities, but in doing so we must take into4 CHAPTER 5. LOG-LINEAR MODELSaccount the fact that the probabilities add up to one over the entire table.This restriction may be imposed by adding a Lagrange multiplier, or moresimply by writing the last probability as the complement of all others. Ineither case, we find the unrestricted maximum likelihood estimate to be thesample proportion:ˆπij=yijn.Substituting these estimates into the log-likelihood function gives its unre-stricted maximum.Under the hypothesis of independence in Equation 5.3, the joint proba-bilities depend on the margins. Taking derivatives with respect to πi.andπ.j, and noting that these are also constrained to add up to one over therows and columns, respectively, we find the m.l.e.’sˆπi.=yi.nand ˆπ.j=y.jn,where yi.=Pjyijdenotes the row totals and y.jdenotes the column totals.Combining these estimates and multiplying by n to obtain expected countsgivesˆµij=yi.y.jn,which is the familiar

View Full Document