Exploratory data analysis with two qualitative variablesExploratory data analysis with two qualitative/categorical variablesMotivating exampleTitanic: Class by survivalTitanic: Marginal frequenciesTitanic: Conditional frequenceisTitanic: Class by person typeTitanic: percentage of men in each classSurviving the TitanicRelative risk and odds ratiosPhysicians’ health studyRelative riskRelative risk for physicians’ health studyOdds ratiosOdds ratios for physicians’ health studyInterpreting odds ratios and relative risksWhich one should be used?Odds ratio vs relative riskRelative risk vs absolute riskSimpsons paradoxNot in FPPExploratory data analysis with two qualitative variables1Exploratory data analysis with two qualitative/categorical variablesMain toolsContigency tablesConditional, marginal, and joint frequencies2Motivating exampleSurviving the TitanicWas there a class discrimination in survival of the wreck of the Titanic?“It has been suggested before the Enquiry that the third-class passengers had been unfairly treated, that their access to the boat deck had been impeded; and that when they reached the deck the first and second-class passengers were given precedence in getting places in the boats.” Lord Mersey, 1912 3Titanic: Class by survival11stst ClassClass22ndnd ClassClass33rdrd ClassClassCrewCrewDeadDead12212216716752852869669615131513AliveAlive203203118118178178212212711711325325285285706706908908222422244Titanic: Marginal frequencies% Dead = 1513/2224 = 0.68% Alive = 711/2224 = 0.32% in first class = 325/2224 = 0.14% in second class = 285/2224 = 0.13% in third class = 706/2224 = 0.32% crew = 908/2224 = 0.41 5Titanic: Conditional frequenceis% (Alive | 1st) = 203/325 = 0.625% (Alive | 2nd) = 118/285 = 0.414% (Alive | 3rd) = 178/706 = 0.252% (Alive | Crew) = 212/908 = 0.233Based on these frequencies does there appear to be class discrimination?6Titanic: Class by person type1st Class2nd Class3rd ClassCrewChild. 6 24 79 0 109Wom. 144 93 165 23 425Men 175 168 462 885 1690325 285 706 908 22247Titanic: percentage of men in each class% (Man | 1st) = 175/325 = 0.54% (Man | 2nd) = 168/285 = 0.59% (Man | 3rd) = 462/706 = 0.65% (Man | Crew) = 885/908 = 0.97There are larger percentages of men in third class and crew8Surviving the TitanicA reason for class differences in survival:Larger percentages of men died3rd class consisted of mostly men.Hence, a larger percentage of 3rd class passengers died.Once again keep in mind possible lurking variables that could be driving the relationship seen between two measured variables9Relative risk and odds ratiosMotivating examplePhysicians’ health study (1989): randomized experiment with 22071 male physicians at least 40 years oldHalf the subjects assigned to take aspirin every other dayOther half assigned to take a placebo, a dummy pill that looked and tasted like aspirin10Physicians’ health studyHere are the number of people in each cell:11Relative risky1 y2x1 a ba+bx2 c dc+da+cb+dRisk of y1 for level x1=a/(a+b)Risk of y1 for level x2=c/(c+d)€ Relative risk =a/(a +b)c /(c + d)12Relative risk for physicians’ health studyRelative risk of a heart attack when taking aspirin versus when taking a placebo equalsPeople that took aspirin are 0.55 times as likely to have a heart attack than people that took the placeboOr people that took placebo are 1/0.55 = 1.82 times as likely to have a heart attack than people that took aspirin€ RR =104 /(104 + 10933)189 /(189 +10845)= 0.5513Odds ratiosy1 y2x1 a bx2 c dOdds of y1 for level x1=a/bOdds of y1 for level x2=c/d€ Odds ratio =a/bc /d14Odds ratios for physicians’ health studyRelative risk of a heart attack when taking aspirin versus taking a placebo isOdds of having a heart attack when taking aspirin over odds of a heart attack when taking a placebo (odds ratio)€ RR =104 /(104 + 10933)189 /(189 +10845)= 0.55€ OR =104 /10933189 /10845= 0.54615Interpreting odds ratios and relative risksWhen the variables X and Y are independentodds ratio = 1 relative risk = 1When subjects with level x1 are more likely to have y1 than subjects with level x2, theodds ratio > 1 relative risk > 1When subjects with level x1 are less likely to have y1 than subjects with level x2, thenodds ratio < 1 relative risk < 116Which one should be used?If Relative Risk is available then it should be usedIn a cohort study, the relative risk can be calculated directlyIn a case-control study the relative risk cannot be calculated directly, so an odds ratio is used insteadCase-control studies is an example. They compare subjects who have a “condition” to subjects that don’t but have similar controlsIn this type of study we know %(exposure|disease). But to compute the RR we need %(disease|exposure).Recall that RR = %(disease|exposure)/%(disease|placebo)Not available in more complex modeling (logistic regression)17Odds ratio vs relative riskWhen is odds ratio a good approximation of relative riskWhen cases are representative of diseased populationWhen controls are representative of population without diseaseWhen the disease being studied occurs at low frequencyOf itself, an odds ratio is a useful measure of association18Relative risk vs absolute risk% smokers who get lung cancer: 8% (conservative guess here)Relative risk of lung cancer for smokers: 800%Getting lung cancer is not commonplace, even for smokers. But, smokers’ chances of getting lung cancer are much, much higher than non-smokers’ chances.19Simpsons paradoxWhen a third variable seemingly reverses the association between two other variables Hot hand
View Full Document