0 0 16 views

**Unformatted text preview:**

Topic 10 Contingency Tables Ch 15 1 Chi Square Test A contingency table is a tabular arrangement of nominal data from multiple populations 1 1 2x2 tables One way to analyze such data is the chi square test Suppose one takes a random sample of n units and then categorizes the units of the basis of 2 or more categorical variables For simplicity let s consider first a 2x2 table where there are two categorical variables each with two possible outcomes As an example consider a random sample of n 793 people involved in bicycle accidents The accident report specifies whether or not each person 1 was wearing a helmet and 2 suffered a head injury Suppose 147 were wearing helmets and 646 were not In the group with helmets 17 or p1 116 had a head injury and 130 or 1 p1 884 did not whereas in the group without helmets 218 or p2 337 had a head injury and 428 or 1 p2 663 did not The data in the form of a 2x2 contingency table are Helmet Yes No Total Head Injury Yes 17 218 235 No 130 428 558 Total 147 646 793 1 Arranged as proportions the data are Head Injury Helmet Yes Yes 116 No 884 Total 1 0 No 337 663 1 0 Total 296 704 1 0 An obvious question is whether there is any association between the incidence of head injury and the use of helmets among those involved in bicycle accidents The generic null hypothesis is H 0 variable A is independent of variable B or in the context of this problem H 0 the incidence of head injuries is independent of or has no association with the use of helmets One alternative way to state the equivalent hypothesis as given in the text is H 0 p1 p2 vs H A p1 p2 where p1 and p2 are the proportions of head injuries for those with helmets and those without helmets respectively Note that if p1 p2 there is no association between the two variables Recall that we had a test for p1 p2 in the previous chapter A chi square test gives us an alternative way to solve the problem a method that will generalize to more than 2 categories for one or both of the variables To carry out the test consider the following nomenclature Let Oij denote the observed count in row i and column j Oi i the total in row i for i 1 for j 1 r Oi j the total in column j c and n the grand total of all observations This gives table Row 1 1 O11 2 O21 r Total Column 2 O12 c O1c Total O1i O22 O2 c O2 i Or1 Or 2 Orc Oi1 Oi 2 Oi c Or i n 2 To calculate the test statistic one must first calculate the expected counts assuming H 0 is true e g if the use of helmets has no association with the incidence of head injuries If true and hence p1 p2 then the best estimate of the common injury rate is the total number of head injuries or O1i O11 O12 divided by the total sample size n For these data that is O1i n 235 793 0 296 Recall that Oi1 147 people wore helmets Therefore assuming independence the expected number of people wearing helmets that would sustain a head injury is Oi1 O1i n Labeling the expected number in row 1 and column 1 as E11 one has E11 O1iOi1 n For these data E11 235 147 793 43 6 Note that E21 which is the expected number of helmet users not sustaining a head injury is by similar reasoning E21 558 147 793 103 4 This could also be found by subtraction i e E21 Oi1 E11 2 In general one can find the expected counts as Eij Oi iOi j n 3 Using formula 3 for the data in 1 the table of expected counts is Head Injury Helmet Yes Yes 43 6 No 103 4 Total 147 No 191 4 454 6 646 Total 235 558 793 4 Note that the column and row totals in 1 and 4 are the same but in 4 the counts are redistributed to give expected values under H 0 The test statistic is 2r 1 c 1 Oij Eij 2 Eij ij 5 Under H 0 this is a chi square statistic with r 1 c 1 df provided 1 all Eij 1 2 no more than 20 of Eij 5 The 2 is illustrated in Figure 15 1 As an example for the present data one has 17 43 6 2 130 103 4 2 218 191 4 2 12 43 6 103 4 191 4 2 428 454 6 28 3 454 6 To determine whether this is large under H 0 one can find the RR in Table A 8 This is a one sided test with critical value 2 1 05 3 84 Therefore for this data one would reject H 0 with p 0 001 How would you interpret the result The hypothesis testing framework is 1 H 0 variables are independent 2 H A variables have some association 3 TS 2r 1 c 1 in 5 4 RR 2 2 5 Calculations Find Eij in 3 and substitute data into 5 3 For the case of a 2x2 table with small n the use of the continuity correction improves the approximation The test statistic with the correction is 2 2 Oij Eij 0 5 Eij ij Clearly it would always reduce the 2 statistic When used with the bicycle helmet data the new value of the statistic is 12 27 3 Another test procedure for 2x2 tables is called Fisher s exact test It is computationally intensive Though used in many statistical software packages we will not develop it 1 2 r x c tables Consider a table in which r and or c exceeds 2 This is called an r x c table The general table layout is given in 2 As an example consider the text example There are 575 death certificates which are investigated and classified according to 2 variables One variable is type of hospital with outcomes A or B denoting community and university respectively The other is death certificate accuracy with 3 possible outcomes The data are Hospital A B Total Certificate Status Accurate Incomplete Needs Change 157 18 54 268 44 34 425 62 88 Total 229 346 575 The general hypothesis which one could test is H 0 death certificate status is independent of hospital type This hypothesis is sometimes expressed in an equivalent but more technical way Let pij denote the proportion of certificates from hospital i with certificate status j This would give the table Hospital 1 Certificate Status 1 2 p11 p12 3 p13 Total 1 0 2 p21 p23 1 0 p22 The null hypothesis is that the proportions in each certificate status are the same for both hospitals i e H 0 p11 p21 p12 p22 and p13 p23 4 H A the proportions in some …