PSU STAT 504 - Various Topics from Agresti Chapters 2–3

Unformatted text preview:

Stat 504, Lecture 6 1✬✫✩✪2-Way Tables:Various Topics fromAgresti, Chapters 2–3Fisher’s exact testOn rare occasions, we may encounter a 2 × 2 tablewhere the row totals ni+and the column totals n+jare both fixed by design. When this happens, we maytest for an association between X and Y usingFisher’s exact test.ExampleA lady claims to be able to discern, by taste alone,whether a cup of tea with milk had the tea pouredfirst or the milk poured first. An experiment wasperformed to see if her claim is valid. Eight cups oftea are prepared and presented to her in randomorder. Four had the milk poured first, and four hadthe tea poured first. The lady tasted each one andrendered her opinion. The results are:Stat 504, Lecture 6 2✬✫✩✪Lady saysPoured first tea first milk firsttea 3 1milk 1 3The row totals are fixed by the experimenter. Thecolumn totals are fixed by the lady, who knows thatfour of the cups are “tea first” and four are “milkfirst.”Under H0: “the lady has no ability,” the four cupsshe calls “tea first” are a random sample of the eight.If she selects four at random, the probability thatthree of these four are actually “tea first” comes fromthe hypergeometric distribution:434184=4!3! 1!4!1! 3!8!4! 4!=1670= .229.A p-value is the probability of getting a result asextreme or more extreme than the event you actuallydid observe, if H0is true. The only result moreextreme is that the woman selects all four of the cupsthat are truly “tea first,” which has probabilityStat 504, Lecture 6 3✬✫✩✪444084=170= .014.The p-value is .229 + .014 = .243, which is only weakevidence against the null.This test is “exact” because no large-sampleapproximations are used. The p-value is validregardless of the sample size. Extensions of Fisher’sexact test to more general I × J tables are moretedious to compute, and have been implementedprograms such as StatXact.Fisher’s exact test is definitely appropriate when therow totals and column totals are both fixed by design.Some have argued that it may also be used when onlyone set of margins is truly fixed. This idea arisesbecause the marginal totals {n1+,n+1} provide littleinformation about the odds ratio θ.Stat 504, Lecture 6 4✬✫✩✪Exact non-null inference for θ(Section 3.6) In a 2 × 2 table where n1+and n+1areboth fixed, the distribution of n11(which determinesn12,n21,n22) depends only on the odds ratio θ.Whenθ = 1, this distribution is hypergeometric, which weused in Fisher’s exact test. More generally, Fisher(1935) gave this distribution for any value of θ. Usingthis distribution, it is easy to compute Fisher’s exactp-value for testing the null hypothesis H0: θ = θ∗forany θ∗. The set of all values θ∗that cannot berejected in a level α = .05, two-tailed test forms anexact 95% confidence region for θ.Bias correction for estimating θ(Section 3.4.1) In Lecture 4, we learned that thenatural estimate of θ isˆθ =n11n22n12n21and that logˆθ is approximately normally distributedabout log θ with estimated varianceˆV (logˆθ)=1n11+1n12+1n21+1n22.Stat 504, Lecture 6 5✬✫✩✪Note that this is the estimated variance of thelimiting distribution, not an estimate of the varianceof logˆθ itself. Because there is a nonzero probabilitythat the numerator or the denominator ofˆθ may bezero, the moments ofˆθ and logˆθ do not actually exist.In Section 3.4.1, Agresti suggests a modified estimatewhich comes from adding 1/2toeachnij,˜θ =(n11+0.5)(n22+0.5)(n12+0.5)(n21+0.5),with estimated varianceˆV (log˜θ)=i,j1(nij+0.5).One motivation for using˜θ rather than θ is thatadding 1/2 to each cell count wipes out thesecond-order term in the Taylor expansion for log θ,reducing the bias. In smaller samples, log˜θ may beslightly less biased than logˆθ.Stat 504, Lecture 6 6✬✫✩✪The use of˜θ can also be motivated on Bayesiangrounds. Adding constants such as 1/2 to the cellfrequencies can be interpreted as Bayesian inferenceunder a particular kind of prior distribution(Dirichlet). For discussion of Bayesian inference undera Dirichlet prior, see Section 13.4.The practical effect of adding a constant such as 1/2is to smooth the estimated cell probabilities toward auniform table, where all elements of π are equal. In alarge, sparse table, adding 1/2 to each cell freqencycould lead to oversmoothing, because the totalnumber of hypothetical prior observations beingadded (1/2 times the number of cells) could be nearlyas large or even larger than the actual sample size n.Stat 504, Lecture 6 7✬✫✩✪Test for independence in an I × J tableWith two categorical variables,X, taking possible values i =1, 2,...,I,Y , taking possible values j =1, 2,...,J,the counts are usually arranged in a two-way table:Y =1 Y =2 ··· Y = JX =1 n11n12··· n1JX =2 n21n22··· n2J...............X = I nI1nI2··· nIJThe total number of cells isN = IJ,and the marginal totals are:ni+=Jj=1nij,i=1,...,I (rows)n+j=Ii=1nij,j=1,...,J (columns)n++=Ii=1Jj=1nij= n (grand total)Stat 504, Lecture 6 8✬✫✩✪Let us suppose that the cell counts (n11,...,nIJ)have a multinomial distribution with index n++= nand parametersπ =(π11,...,πIJ)(these results also work for Poisson orproduct-multinomial sampling). Because the elementsof π must sum to one, the saturated model has IJ − 1free parameters.If the two variables X and Y are independent, then πhas a special form. Letαi= P (X = i),i=1, 2,...,I,βj= P (Y = j),j=1, 2,...,J.Note thatIi=1αi=Jj=1βj= 1, so the vectorsα =(α1,α2,...,αI) and β =(β1,β2,...,βJ) containI − 1 and J − 1 unknown parameters, respectively. IfX and Y are independent, then an element of π canbe expressed asπij= P (X = i)P (Y = j)=αiβj. (1)Thus, under independence, π is a function of(I − 1) + (J − 1) unknown parameters.Stat 504, Lecture 6 9✬✫✩✪The parameters under the independence model can beestimated as follows. Note that the vector of rowsums,(n1+,n2+,...,nI+),has a multinomial distribution with index n andparameter α. The vector of column sums,(n+1,n+2,...,n+J),has a multinomial distribution with index n = n++and parameter β. The elements of α and β can thusbe estimated byˆαi= ni+/n++,i=1, 2,...,I,andˆβj= n+j/n++,j=1, 2,...,J,respectively. Using (1), the estimated expected cellfrequencies under the independence model


View Full Document

PSU STAT 504 - Various Topics from Agresti Chapters 2–3

Download Various Topics from Agresti Chapters 2–3
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Various Topics from Agresti Chapters 2–3 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Various Topics from Agresti Chapters 2–3 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?