UF STA 6126 - Association between categorical variables

Unformatted text preview:

8. Association between Categorical VariablesSlide 2Guidelines for Contingency TablesIndependence & DependenceChi-Squared Test of Independence (Karl Pearson, 1900)Slide 6Slide 7Chi-Squared Test StatisticProperties of chi-squared distributionExample: Happiness and family incomeSoftware output (SPSS)Comments about chi-squared testSlide 13Slide 14Example (from Chap. 7): College Alcohol Study conducted by Harvard School of Public HealthResiduals: Detecting Patterns of AssociationSlide 17SPSS OutputSlide 19Slide 20A couple more happiness analysesSlide 22Measures of AssociationExample: Opinion about George W. Bush performance as President (9/08 Gallup poll)Slide 25Comparisons using ratiosThe “odds”The odds ratioSlide 29Properties of the odds ratioSlide 31Limitations of the chi-squared testExample: Effect of n on statistical significance (for a given degree of association)Example (small P-value does not imply strong association)Slide 35Slide 368. Association between Categorical Variables•Suppose both response and explanatory variables are categorical. (Chap. 9 considers both quantitative.)•There is association if the population conditional distribution for the response variable differs among the categories of the explanatory variable Example: Contingency table on happiness cross-classified by family income (data from 2006 GSS)Happiness Income Very Pretty Not too Total --------------------------------------------- Above 272 (44%) 294 (48%) 49 (8%) 615 Average 454 (32%) 835 (59%) 131 (9%) 1420 Below 185 (20%) 527 (57%) 208 (23%) 920 ----------------------------------------------Response: Happiness, Explanatory: IncomeThe sample conditional distributions on happiness vary by income level, but can we conclude that this is also true in the population?Guidelines for Contingency Tables•Show sample conditional distributions: percentages for the response variable within the categories of the explanatory variable. Find by dividing the cell counts by the explanatory category total and multiplying by 100. (Percents on response categories will add to 100) •Clearly define variables and categories.•If display percentages but not the cell counts, include explanatory total sample sizes, so reader can (if desired) recover all the cell count data.(I use rows for explanatory var., columns for response var.)Independence & Dependence•Statistical independence (no association): Population conditional distributions on one variable the same for all categories of the other variable•Statistical dependence (association): Conditional distributions are not all identicalExample of statistical independence: Happiness Income Very Pretty Not too ----------------------------------------- Above 32% 55% 13% Average 32% 55% 13% Below 32% 55% 13%Chi-Squared Test of Independence (Karl Pearson, 1900)•Tests H0: The variables are statistically independent •Ha: The variables are statistically dependent• Intuition behind test statistic: Summarize differences between observed cell counts and expected cell counts (what is expected if H0 true)•Notation: fo = observed frequency (cell count) fe = expected frequency r = number of rows in table, c = number of columnsExpected frequencies (fe): –Have identical conditional distributions. Those distributions are same as the column (response) marginal distribution of the data. –Have same marginal distributions (row and column totals) as observed frequencies–Computed by fe = (row total)(column total)/nHappiness Income Very Pretty Not too Total -------------------------------------------------- Above 272 (189.6) 294 (344.6) 49 (80.8) 615 Average 454 (437.8) 835 (795.8) 131 (186.5) 1420 Below 185 (283.6) 527 (515.6) 208 (120.8) 920 --------------------------------------------------Total 911 1656 388 2955e.g., first cell has fe = 615(911)/2955 = 189.6.fe values are in parentheses in this tableChi-Squared Test Statistic•Summarize closeness of {fo} and {fe} by with sum is taken over all cells in the table.•When H0 is true, sampling distribution of this statistic is approximately (for large n) the chi-squared probability distribution.22( )o eef ffc-=�Properties of chi-squared distribution•On positive part of line only•Skewed to right (more bell-shaped as df increases)•Mean and standard deviation depend on size of table through df = (r – 1)(c – 1) = mean of distribution, where r = number of rows, c = number of columns•Larger values incompatible with H0, so P-value = right-tail probability above observed test statistic value.Example: Happiness and family incomedf = (3 – 1)(3 – 1) = 4. P-value = 0.000 (rounded, often reported as P < 0.001). Chi-squared percentile values for various right-tail probabilities are in table on text p. 594.There is very strong evidence against H0: independence (namely, if H0 were true, prob. would be < 0.001 of getting this large a test statistic or even larger). For significance level  = 0.05 (or  = 0.01 or  = 0.001), we reject H0 and conclude an association exists between happiness and income.222( )(272 189.6)... 172.3189.6o eef ffc--= = + =�Software output (SPSS)Comments about chi-squared test•Using chi-squared dist. to approx the actual sampling dist of the test statistic works well for “large” random samples. (Cochran (1954) showed it works ok in practice if all or nearly all fe ≥ 5)•For smaller samples, Fisher’s exact test applies (we skip)•Most software also reports “likelihood-ratio chi squared,” an alternative chi-squared test statistic.•Chi-squared test treats variables as nominal scale (re-order categories, get same result). For ordinal variables, more powerful tests are available (such as in Sections 8.5 and 8.6 of text), which we skip. We’ll use regression methods in Ch. 9.(Coming soon: Analysis of Ordinal Categorical Data, 2nd ed.)2c•df = (r – 1)(c - 1) means that for given marginal counts, a block of size (r – 1)(c – 1) cell counts determines the other counts. (Ronald Fisher 1922; Pearson, in 1900, said df = rc - 1)•If z is a statistic that has a standard normal dist., then z2 has a


View Full Document

UF STA 6126 - Association between categorical variables

Download Association between categorical variables
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Association between categorical variables and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Association between categorical variables 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?