PSU STAT 504 - Two Way Tables - D1034161

Home> Schools> Penn State University> Statistics (STAT) > STAT 504> Two Way Tables

PSU STAT 504 - Two Way Tables

Course Stat 504- Analysis of Discrete Data

Pages 8

Download Save

Unformatted text preview:

Stat 504, Lecture 5 1✬✫✩✪Introduction toTwo-Way TablesSuppose that we collect data on two binary variables,Y and Z. Binary means that these variables take twopossible values, say 1 and 2. Suppose we collectvalues of Y and Z for n sample units. The data thenconsist of n pairs,(y1,z1), (y2,z2), ..., (yn,zn).We can summarize the data in a frequency table. Letxijbe the number of sample units having Y = i andZ = j. Then x =(x11,x12,x21,x22) is a summary ofall n responses. We could display x as a one-waytable with four cells, but it is customary to display xas a square table with two rows and two columns:Z =1 Z =2Y =1 x11x12Y =2 x21x22Stat 504, Lecture 5 2✬✫✩✪Marginal totals. When a subscript in a cell count xijis replaced by a plus sign (+), it will mean that wehave taken the sum of the cell counts over thatsubscript. The row totals arex1+= x11+ x12,x2+= x21+ x22,the column totals arex+1= x11+ x21,x+2= x12+ x22,and the grand total isx++= x11+ x12+ x21+ x22= n.These quantities are often called marginal totals,because they are conveniently placed in the marginsof the table, like this.Z =1 Z =2 totalY =1 x11x12x1+Y =2 x21x22x2+total x+1x+2x++Stat 504, Lecture 5 3✬✫✩✪If the sample units are randomly sampled from alarge population, then x =(x11,x12,x21,x22) willhave a multinomial distribution with index n = x++and parameter vectorπ =(π11,π12,π21,π22),where πij= P (Y = i, Z = j).The independence modelGiven a 2 × 2 table, it is natural to ask how Y and Zare related. Suppose for the moment that there is norelationship between Y and Z, i.e. that they areindependent. Independence means thatπij= P (Y = i, Z = j)=P (Y = i) P (Z = j)for i, j =1, 2. Let P (Y =1)=α and P (Z =1)=β,so that P (Y =2)=1− α and P (Z =2)=1− β.Under independence, we haveπ11= P (Y =1)P (Z =1) = αβ, (1)π12= P (Y =1)P (Z =2) = α(1 − β), (2)π21= P (Y =2)P (Z =1) = (1− α)β, (3)π22= P (Y =2)P (Z =2) = (1− α)(1 − β).(4)Stat 504, Lecture 5 4✬✫✩✪Note thatα = π1+= π11+ π12,1 − α = π2+= π21+ π22,β = π+1= π11+ π21,1 − β = π+2= π12+ π22,so the condition of independence can be convenientlywritten asπij= πi+π+j,i,j=1, 2. (5)The primary reason that we introduced the symbols αand β for π1+and π+1is to emphasize that under theindependence model, there are only two unknownparameters. Once α and β are known, the vector πcan be found using (1)–(4). Under a generalmultinomial model, the π vector contains threeunknown parameters. The general multinomial modelis often called the saturated model, because it containsthe maximum number of unknown parameters. Theindependence model is a submodel of (i.e. a specialcase of) the saturated model that satisﬁes theconstraints (5).Stat 504, Lecture 5 5✬✫✩✪Test of independenceThe hypothesis of independence can be tested usingthe general method described in Lecture 4. To testH0: the independence model is trueversusH1: the saturated model is true,dothe following. First, estimate α and β, theunknown parameters of the independence model.Second, calculate estimated cell probabilities andexpected frequencies from the estimated α and β.Third, calculate X2and/or G2and compare them tothe appropriate chisquare distribution.How can we estimate α and β? Under H0, Y are Zprovide no information about one another, so we canestimate the parameters of their distributionsseparately. Note thatx1+∼ Bin(n, α) (6)andx+1∼ Bin(n, β), (7)and under H0(6) and (7) are independent.Stat 504, Lecture 5 6✬✫✩✪Therefore, the ML estimates of α and β areˆα =x1+nandˆβ =x+1n.Plugging these estimates into (1)–(4) gives estimatedprobabilitiesˆπ11=x1+nx+1n, ˆπ12=x1+nx+2n,ˆπ21=x2+nx+1n, ˆπ22=x2+nx+2n,and estimated expected cell countsE11= nˆπ11=x1+x+1n,E12= nˆπ12=x1+x+2n,E21= nˆπ21=x2+x+1n,E22= nˆπ22=x2+x+2n.These four formulas are conveniently summarized asEij=xi+x+jn,i,j=1, 2,which can be easily remembered asexpected frequency =row total × column totalgrand total.Stat 504, Lecture 5 7✬✫✩✪Under H0, both X2and G2are approximately χ2provided that the expected counts Eijare suﬃcientlylarge. Under H0the model has 2 unknownparameters, whereas under H1there are 3 unknowns.The degrees of freedom are thereforeν =3− 2=1.A large value of X2or G2indicates that theindependence model is not plausible, and thus that Yand Z are related. The 95th percentile of χ21is 3.96,so an observed value of X2or G2greater than 4means that we can reject the null hypothesis ofindependence at the .05 level.The test for independence in a 2 × 2 table is a specialcase of the general goodness-of-ﬁt test discussed inLecture 4. Therefore, all of the caveats regardinggoodness-of-ﬁt tests discussed there apply to this testalso. For the chisquare approximation to work well,the Eij’s need to be suﬃciently large. The iidassumption for the n sample units must be satisﬁed;there should be no clustering in the data.Stat 504, Lecture 5 8✬✫✩✪Example. Suppose that in a sample of n = 300hospital patients, 90 are overweight, 90 arehypertensive, and 30 are both overweight andhypertensive. Is there evidence of a relationshipbetween these two conditions? The observed data areshown below.nothypertensive hypertensive totaloverweight 30 60 90not overweight 60 150 210total 90 210 300The expected cell counts for the four cells areE11=90 × 90300=27,E12=90 × 210300=63,E21=210 × 90300=63,E22=210 × 210300= 147.The goodness-of-ﬁt statistics areX2=(30 − 27)227+(60 − 63)263+(60 − 63)263+(150 − 147)2147=0.68,Stat 504, Lecture 5 9✬✫✩✪G2=230 log3027+ 60 log6063+60log6063+ 150 log150147=0.67.These do not exceed 4, so we cannot reject theindependence model at the .05 level. An approximatep-value is P (χ21≥ .68) = .40. On the basis of thesedata, there is little evidence of a relationship betweenthe two conditions.The test for independence in a 2 × 2 table can bedone in Minitab using the chisq command:MTB > read c1-c2DATA> 30 60DATA> 60 150DATA> end2 rows read.MTB > chisq c1-c2Expected counts are printed below observed countsC1 C2 Total130609027.00 63.002 60 150 21063.00 147.00Stat 504, Lecture 5 10✬✫✩✪Total 90 210 300ChiSq = 0.333 + 0.143 +0.143 + 0.061 = 0.680df = 1Note that Minitab gives only Pearson’s X2.Calculating the deviance G2in Minitab is a littlemore tedious. One way to do it is to enter the cellcounts in a single column, say, C1. Then enter the rowsums and column sums in C2 and C3,

View Full Document


School:
Email:
New Password:
Confirm Password:

PSU STAT 504 - Two Way Tables

Sign up for free to view:

Please select your school