DOC PREVIEW
PSU STAT 504 - Stat 504 Lecture 4

This preview shows page 1-2-14-15-29-30 out of 30 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 30 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 30 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 30 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 30 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 30 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 30 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 30 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Stat 504, Lecture 4 1✬✫✩✪One-Way Tablesand Goodness of FitA frequency table arises when sample units areclassified into mutually exclusive categories; thenumber of units falling into each category is recorded.“One way” means that units are classified accordingto a single categorical variable. The frequencies orcounts can be arranged in a single row or column.Example. A sample of n = 96 persons is obtained,and the eye color of each person is recorded.Eye color CountBrown 46Blue 22Green 26Other 2Total 96Brown, blue, green, and other have no intrinsicordering. The response variable, eye color, is thereforean unordered categorical or nominal variable.Stat 504, Lecture 4 2✬✫✩✪Example. Hypothetical attitudes of n = 116 peopletoward war against Iraq.Attitude CountStrongly disagree 35Disagree 27Agree 23Strongly agree 31Total 116The response categories in this example are clearlyordered, but no objectively defined numerical scorescan be attached to the categories. The responsevariable, attitude, is therefore said to be an orderedcategorical or ordinal variable.Stat 504, Lecture 4 3✬✫✩✪Example. Same as above, but with additionalcategories for “not sure” and “refused to answer.”Attitude CountStrongly disagree 35Disagree 27Agree 23Strongly agree 31Not sure 6Refusal 8Total 130The first four categories are clearly ordered, but theplacement of “not sure” and “refusal” in the orderingis questionable. We would have to say that thisresponse is partially ordered.Stat 504, Lecture 4 4✬✫✩✪Example. Number of children in n = 105 randomlyselected families.Number of children Count0241262293134–5 116+ 2Total 105The original data, the raw number of children, hasbeen coarsened into six categories (0, 1, 2, 3, 4–5,6+). These categories are clearly ordered, but—unlikethe previous example—the categories have objectivelydefined numeric values attached to them. We can saythat this table represents coarsened numeric data.Stat 504, Lecture 4 5✬✫✩✪Example. Total gross income of n = 100 households.Income Countbelow $10,000 11$10,000–$24,999 23$25,000–$39,999 30$40,000–$59,999 24$60,000 and above 12Total 100The original data (raw incomes) were essentiallycontinuous. Any type of data, continuous or discrete,can be grouped or coarsened into categories.Grouping data will typically result in some loss ofinformation. How much information is lost dependson (a) the number of categories and (b) the questionbeing addressed. In this example, grouping hassomewhat diminished our ability to estimate themean or median household income. Our ability toestimate the proportion of households with incomesbelow $10,000 has not been affected, but estimatingthe proportion of households with incomes above$75,000 is now virtually impossible.Stat 504, Lecture 4 6✬✫✩✪AssumptionsA one-way frequency table with k cells will bedenoted by the vectorX =(X1,X2,...,Xk),where Xjis the count or frequency in cell j. Notethat X is a summary of the original raw datasetconsisting of n =kj=1Xjobservations. In the fourthexample (number of children), for instance, the rawdata consisted of n = 105 integers; 24 of them were 0,26 of them were 1, etc.We will typically assume that X has a multinomialdistribution,X ∼ Mult(n, π),where n is fixed and known, andπ =(π1,π2,...,πk)may or may not have to be estimated. If n is random,we can still apply the multinomial model (Lecture 3).But we do have to worry about other kinds of modelfailure.Stat 504, Lecture 4 7✬✫✩✪The critical assumptions of the multinomial are (a)the n trials are independent, and (b) the parameter πremains constant from trial to trial. The mostcommon violation of these assumptions occurs whenclustering is present in the data. Clustering meansthat some of the trials occur in groups or clusters,and that trials within a cluster tend to have outcomesthat are more similar than trials from differentclusters. Clustering can be thought of as a violationof either (a) or (b).Example. Recall the first example, in which eye colorwas recorded for n = 96 persons. Suppose that thesample did not consist of unrelated individuals, butthat some whole families were present. Persons withina family are more likely to have similar eye color thanunrelated persons, so the assumptions of themultinomial model would be violated.Stat 504, Lecture 4 8✬✫✩✪Now suppose that the sample consisted of “unrelated”persons randomly selected within Pennsylvania. Inother words, persons are randomly selected from a listof Pennsylvania residents. If two members of thesame family happen to be selected into the samplepurely by chance, that’s okay; the important thing isthat each person on the list has an equal chance ofbeing selected, regardless of who else is selected.Could this be considered a multinomial situation? Forall practical purposes, yes. The sampled individualsare not independent according the common Englishdefinition of the word, because they all live inPennsylvania. But we can suppose that they areindependent from a statistical viewpoint, because theindividuals are exchangeable; before the sample isdrawn, no two are a priori any more likely to have thesame eye color than any other two.Stat 504, Lecture 4 9✬✫✩✪Pearson and deviance statisticsNow we introduce two statistics that measuregoodness of fit; that is, they measure how well anobserved table X corresponds to the model Mult(n, π)for some vector π.The Pearson goodness-of-fit statistic isX2=kj=1(Xj− nπj)2nπj.An easy way to remember it isX2=j(Oj− Ej)2Ej,where Oj= Xjis the observed count andEj= E(Xj)=nπjis the expected count in cell j.The deviance statistic isG2=2kj=1Xjlog Xjnπj,where “log” means natural logarithm. An easy way toremember it isG2=2jOjlog OjEj.Stat 504, Lecture 4 10✬✫✩✪In some texts, G2is also called the likelihood-ratiotest statistic. A common mistake in calculating G2isto leave out the factor of 2 at the front; don’t forgetthe 2.Note that X2and G2are both functions of theobserved data X and a vector of probabilities π.Forthis reason, we will sometimes write them as X2(x, π)and G2(x, π), respectively; when there is noambiguity, however, we will simply use X2and G2.Testing goodness of fitX2and G2both measure how closely the modelMult(n, π) “fits” the observed data.• If the sample proportions pj= Xj/n are exactlyequal to the model’s πjfor j =1, 2,...,k, thenOj= Ejfor all j, and


View Full Document

PSU STAT 504 - Stat 504 Lecture 4

Download Stat 504 Lecture 4
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Stat 504 Lecture 4 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Stat 504 Lecture 4 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?