DOC PREVIEW
UVA STAT 2120 - Topic+10+Notes

This preview shows page 1 out of 2 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 2 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 2 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

STAT 2120: Notes on Topic 10 Analysis of two-way tables: • Some relevant summaries of categorical data in two categories: o Count or percent of the number of “successes” (as in two-sample setup of inference for proportions). o A two-way table, which easily generalizes to multiple categories (i.e., more than just “successes” and “failures”) and multiple samples (i.e., more than just two). • Recall the basic setup of two-way tables: o A two-way table involves a row variable and a column variable. o Notation uses  to denote the number of categories of the row variable and  to denote the number of categories of the row variable. The table is identified as an  ×  table. o A cell is identified with each distinct combination of values among the variables. o The table may record the counts or percentages of data falling in each cell. The distribution of individual cells is called a joint distribution. o The distributions of the row and column variables appear in the margins of the table, and are called marginal distributions. Given as counts they are called row and column totals. o Notation uses  to denote the total number of data points that have been collected. (It is the sum of either a row or column total. In the two-sample setup it is  = + .) o A conditional distribution is calculated from the counts of one variable limited to a given category of the other variable. These help to explore relationships between the variables. • The sampling framework for two-way tables may arise in various ways: o Multiple, independent SRSs of categorical data from distinct populations, in which sample labels form one variable and the categorical data-values form the other. o A single SRS of paired categorical data, each point of which falls in a cell of a table. • The question of interest, in “tabular thinking,” is whether there is a relationship between the row and column variables. o In the two-sample setup, this is equivalent to contemplating := versus :≠. o Another terminology asks whether there is an association between the variables. Here, the comparison is : no association versus : association. The direction of the association under  is unspecified. o The question may be asked more formally as whether the conditional distributions of one variable do not vary across the categories (given as the conditioning event) of the other variable. That is, one contemplates : no variation among conditional distributions versus : variation among conditional distributions. • The basic approach to inference is to compare the observed cell counts to the expected cell counts, as they would be calculated under the null hypothesis of no association. o An expected cell count is calculated as the product of row and column totals corresponding to that combination of categories, divided by . That is, exp. count =rowtotal×columntotal⁄. • The chi-square statistic is a standardized metric that summarizes the distance between observed and expected cell counts, aggregated across all categories. o The formula for the chi-square statistic is !=∑#$%.&#'()*+,-.&#'().+,-.&#'(), where the sum is over all cells of the table. o A large value of ! provides evidence of a relationship between the variables. Thus, a test of : no association versus : association would reject  if ! is large. o A p-value would be calculated from the sampling distribution of ! under . • The family of chi-square (χ) distributions describes the approximate sampling distribution of ! under . o A specific chi-square distribution is denoted χ/, or χdf, where /, or df, is a degree-of-freedom parameter that indexes the family. o Every chi-square distribution is right-skewed and takes only positive values. o Suppose the random variable 2 is χ/,  is a positive number, and 3 is a number between 0 and 1. In Excel, chidist(, /)=42≥ and chiinv(3, /) is the  for which 42≥= 3.o When  is true (no association), the sampling distribution of the chi-square statistic is approximately χ/ with / = − 1 −1. Larger (expected) cell counts improve the accuracy of this approximation, especially when for tables larger than 2 × 2. • The chi-square test for two-way tables is as follows: o Assumptions: A valid sampling framework for two-way tables. o The comparison of hypotheses is : no association versus : association. o The standardized test statistic is !=∑#$%.&#'()*+,-.&#'().+,-.&#'(). o The P-value is calculated as: 42 ≥ ! for 2 having a χ/ distribution with / = − 1 − 1. o A rule of thumb is that the stated significance level for this test is accurate if: 9:;=:;∑exp. count ≥ 5 and every exp. count ≥ 1, when  > 2 or  > 2; every exp. count ≥ 5 when  = 2 or  = 2. • Data in a two-way table may arise from an observational study or a designed experiment. The chi-square test would only establish the causation if the data came from a (designed) comparative, randomized experiment. • In the 2 × 2 case, the p-value of the chi-square test is identical to that of the two-sample z test for proportions, in the formulation comparing : =  versus : ≠ . • Two-way tables may be useful are a tool of meta-analysis, in which the information of several studies are combined and analyzed together. Additional comments on the analysis of two-way tables: • Steps for a generic analysis of data in a two-way table: o Explore the data using relevant descriptive statistics such as histograms of the marginal and conditional distributions (usually after converting to relevant percentages). o Calculate expected cell counts and with these calculate the chi-square statistic. o Complete the test for an association by calculating a P-value (and checking the relevant rule of thumb for validity). o Draw conclusions about association based on the outcome of the test. • Comments on exploratory analysis of conditional distributions: o As discussed before, examination of conditional distributions may help to explore relationships between categorical variables. o To explore a one-way relationship, one would typically examine


View Full Document

UVA STAT 2120 - Topic+10+Notes

Download Topic+10+Notes
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Topic+10+Notes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Topic+10+Notes 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?