PSU STAT 504 - Various Topics from Agresti Chapters 2–3 - D1616942

Home> Schools> Penn State University> Statistics (STAT) > STAT 504> Various Topics from Agresti Chapters 2–3

PSU STAT 504 - Various Topics from Agresti Chapters 2–3

Course Stat 504- Analysis of Discrete Data

Pages 27

Download Save

Unformatted text preview:

Stat 504, Lecture 6 1✬✫✩✪2-Way Tables:Various Topics fromAgresti, Chapters 2–3Fisher’s exact testOn rare occasions, we may encounter a 2 × 2 tablewhere the row totals ni+and the column totals n+jare both ﬁxed by design. When this happens, we maytest for an association between X and Y usingFisher’s exact test.ExampleA lady claims to be able to discern, by taste alone,whether a cup of tea with milk had the tea pouredﬁrst or the milk poured ﬁrst. An experiment wasperformed to see if her claim is valid. Eight cups oftea are prepared and presented to her in randomorder. Four had the milk poured ﬁrst, and four hadthe tea poured ﬁrst. The lady tasted each one andrendered her opinion. The results are:Stat 504, Lecture 6 2✬✫✩✪Lady saysPoured ﬁrst tea ﬁrst milk ﬁrsttea 3 1milk 1 3The row totals are ﬁxed by the experimenter. Thecolumn totals are ﬁxed by the lady, who knows thatfour of the cups are “tea ﬁrst” and four are “milkﬁrst.”Under H0: “the lady has no ability,” the four cupsshe calls “tea ﬁrst” are a random sample of the eight.If she selects four at random, the probability thatthree of these four are actually “tea ﬁrst” comes fromthe hypergeometric distribution:434184=4!3! 1!4!1! 3!8!4! 4!=1670= .229.A p-value is the probability of getting a result asextreme or more extreme than the event you actuallydid observe, if H0is true. The only result moreextreme is that the woman selects all four of the cupsthat are truly “tea ﬁrst,” which has probabilityStat 504, Lecture 6 3✬✫✩✪444084=170= .014.The p-value is .229 + .014 = .243, which is only weakevidence against the null.This test is “exact” because no large-sampleapproximations are used. The p-value is validregardless of the sample size. Extensions of Fisher’sexact test to more general I × J tables are moretedious to compute, and have been implementedprograms such as StatXact.Fisher’s exact test is deﬁnitely appropriate when therow totals and column totals are both ﬁxed by design.Some have argued that it may also be used when onlyone set of margins is truly ﬁxed. This idea arisesbecause the marginal totals {n1+,n+1} provide littleinformation about the odds ratio θ.Stat 504, Lecture 6 4✬✫✩✪Exact non-null inference for θ(Section 3.6) In a 2 × 2 table where n1+and n+1areboth ﬁxed, the distribution of n11(which determinesn12,n21,n22) depends only on the odds ratio θ.Whenθ = 1, this distribution is hypergeometric, which weused in Fisher’s exact test. More generally, Fisher(1935) gave this distribution for any value of θ. Usingthis distribution, it is easy to compute Fisher’s exactp-value for testing the null hypothesis H0: θ = θ∗forany θ∗. The set of all values θ∗that cannot berejected in a level α = .05, two-tailed test forms anexact 95% conﬁdence region for θ.Bias correction for estimating θ(Section 3.4.1) In Lecture 4, we learned that thenatural estimate of θ isˆθ =n11n22n12n21and that logˆθ is approximately normally distributedabout log θ with estimated varianceˆV (logˆθ)=1n11+1n12+1n21+1n22.Stat 504, Lecture 6 5✬✫✩✪Note that this is the estimated variance of thelimiting distribution, not an estimate of the varianceof logˆθ itself. Because there is a nonzero probabilitythat the numerator or the denominator ofˆθ may bezero, the moments ofˆθ and logˆθ do not actually exist.In Section 3.4.1, Agresti suggests a modiﬁed estimatewhich comes from adding 1/2toeachnij,˜θ =(n11+0.5)(n22+0.5)(n12+0.5)(n21+0.5),with estimated varianceˆV (log˜θ)=i,j1(nij+0.5).One motivation for using˜θ rather than θ is thatadding 1/2 to each cell count wipes out thesecond-order term in the Taylor expansion for log θ,reducing the bias. In smaller samples, log˜θ may beslightly less biased than logˆθ.Stat 504, Lecture 6 6✬✫✩✪The use of˜θ can also be motivated on Bayesiangrounds. Adding constants such as 1/2 to the cellfrequencies can be interpreted as Bayesian inferenceunder a particular kind of prior distribution(Dirichlet). For discussion of Bayesian inference undera Dirichlet prior, see Section 13.4.The practical eﬀect of adding a constant such as 1/2is to smooth the estimated cell probabilities toward auniform table, where all elements of π are equal. In alarge, sparse table, adding 1/2 to each cell freqencycould lead to oversmoothing, because the totalnumber of hypothetical prior observations beingadded (1/2 times the number of cells) could be nearlyas large or even larger than the actual sample size n.Stat 504, Lecture 6 7✬✫✩✪Test for independence in an I × J tableWith two categorical variables,X, taking possible values i =1, 2,...,I,Y , taking possible values j =1, 2,...,J,the counts are usually arranged in a two-way table:Y =1 Y =2 ··· Y = JX =1 n11n12··· n1JX =2 n21n22··· n2J...............X = I nI1nI2··· nIJThe total number of cells isN = IJ,and the marginal totals are:ni+=Jj=1nij,i=1,...,I (rows)n+j=Ii=1nij,j=1,...,J (columns)n++=Ii=1Jj=1nij= n (grand total)Stat 504, Lecture 6 8✬✫✩✪Let us suppose that the cell counts (n11,...,nIJ)have a multinomial distribution with index n++= nand parametersπ =(π11,...,πIJ)(these results also work for Poisson orproduct-multinomial sampling). Because the elementsof π must sum to one, the saturated model has IJ − 1free parameters.If the two variables X and Y are independent, then πhas a special form. Letαi= P (X = i),i=1, 2,...,I,βj= P (Y = j),j=1, 2,...,J.Note thatIi=1αi=Jj=1βj= 1, so the vectorsα =(α1,α2,...,αI) and β =(β1,β2,...,βJ) containI − 1 and J − 1 unknown parameters, respectively. IfX and Y are independent, then an element of π canbe expressed asπij= P (X = i)P (Y = j)=αiβj. (1)Thus, under independence, π is a function of(I − 1) + (J − 1) unknown parameters.Stat 504, Lecture 6 9✬✫✩✪The parameters under the independence model can beestimated as follows. Note that the vector of rowsums,(n1+,n2+,...,nI+),has a multinomial distribution with index n andparameter α. The vector of column sums,(n+1,n+2,...,n+J),has a multinomial distribution with index n = n++and parameter β. The elements of α and β can thusbe estimated byˆαi= ni+/n++,i=1, 2,...,I,andˆβj= n+j/n++,j=1, 2,...,J,respectively. Using (1), the estimated expected cellfrequencies under the independence model

View Full Document


School:
Email:
New Password:
Confirm Password:

PSU STAT 504 - Various Topics from Agresti Chapters 2–3

Sign up for free to view:

Please select your school