Displaying and Describing Categorical Data Relationships between categorical variables Summarizing categorical data Two way tables Marginal distributions Conditional distributions Simpson s paradox Summary of Categories Count Each category has a number of occurrences frequency tables Percentages are useful relative frequency tables Sex Male Female Total Count Percentage 157 181 338 46 4 53 6 100 Two Categories What is the relationship between two categories Cross classification table is a good summary contingency tables or two way tables Freshman Sophomore Junior Senior Row Totals Male Female Column Totals 43 48 91 67 62 129 30 25 55 17 46 63 157 181 338 Column Percentages Row Percentages F Sp J SR F Sp J SR M F 43 47 67 52 30 55 17 27 157 46 43 27 67 43 30 19 17 11 157 100 48 53 62 48 25 45 46 73 181 54 48 27 62 34 25 14 46 25 181 100 M F 91 100 129 100 55 100 63 100 338 91 27 129 38 55 16 63 19 338 Which way is better There is a basic asymmetry to many problems Explanatory variable Predictor cause available variable Response variable Predicted effect interesting variable Visualize Categorical Data Give a clear picture of what the data contain Emphasize differences or similarities Bar graphs and pie charts are usually best Many varieties actual form of the graph depends on the use Height of bar or size of pie slice shows the frequency or percentage for each category area principle What is the graph communicating Bar Graph 157 181 Male Female 200 180 160 140 120 100 80 60 40 20 0 Year in School 45 40 35 30 25 20 15 10 5 0 Freshman Sophomore Junior Senior For 2 variables use multiple columns 80 70 60 50 40 30 20 10 0 Freshman Sophomore Junior Senior Male Female Or the other way around 80 70 60 50 40 30 20 10 0 Male Female Freshman Sophomore Junior Senior My Pet Peeves Graphs should Be clear Allow comparisons Tell a story Graphs should not Have uninformative aspects Obscure Arrrgh Male Female 70 60 50 40 30 20 10 0 Arrrgh Arrrgh 70 60 50 40 30 20 10 0 Male Female Wouldn t it just be simpler 80 70 60 50 40 30 20 10 0 Freshman Sophomore Junior Senior Male Female USA Today Snapshots Pie Charts Senior Freshman Junior Sophomore Or Freshman Sophomore Junior Senior 0 20 40 60 80 100 Allows clear comparisons Female Freshman Junior Senior Male Freshman Junior Senior 0 20 40 60 80 100 What is wrong with this picture Constructing Bar and Pie Charts 1 Define categories for variables of interest 2 Determine the appropriate measure for each category For pie charts the value assigned is the proportion of the total for all categories 3 Develop the chart For pie charts the size of the slice is proportional to value and the sum must equal 100 More examples Investor s Portfolio CD Stocks 0 10 20 30 40 50 Amount in 1000 s Bar charts can also be displayed with vertical bars Number of days read Frequency Newspaper readership per week 0 1 2 3 4 5 6 7 44 24 18 16 20 22 26 30 50 40 30 20 10 0 y c n e u e r F Total 200 0 1 2 3 4 5 6 7 Number of days newspaper is read per week Pie Chart Example Current Investment Portfolio Investment Amount Percentage Type in thousands Stocks 46 5 42 27 32 0 29 09 Bonds CD 15 5 14 09 Savings 16 0 14 55 Total 110 100 Savings 15 CD 14 Stocks 42 Qualitative variables Must equal 100 Bonds 29 Percentages are rounded to the nearest percent Marginal distributions We can look at each categorical variable separately in a two way table by studying the row totals and the column totals They represent the marginal distributions expressed in counts or percentages They are written as if in a margin 2000 U S census The marginal distributions can then be displayed on separate bar graphs typically expressed as percents instead of raw counts Each graph represents only one of the two variables completely ignoring the second one Conditional distribution Music and wine purchase decision What is the relationship between type of music played in supermarkets and type of wine purchased We want to compare the conditional distributions of the response variable wine purchased for each value of the explanatory variable music played Therefore we calculate column percents Calculations When no music was played there were 84 bottles of wine sold Of these 30 were French wine 30 84 0 357 35 7 of the wine sold was French when no music was played 30 35 7 84 cell total column total We calculate the column conditional percents similarly for each of the nine cells in the table For every two way table there are two sets of possible conditional distributions Does background music in supermarkets influence customer purchasing decisions Wine purchased for each kind of music played column percents Music played for each kind of wine purchased row percents Simpson s paradox An association or comparison that holds for all of several groups can reverse direction when the data are combined aggregated to form a single group This reversal is called Simpson s paradox Example Hospital death rates Died Survived Total surv Hospital A Hospital B 16 784 800 98 0 63 2037 2100 97 0 On the surface Hospital B would seem to have a better record Patients in good condition But once patient condition is taken into account we see that hospital A has in fact a better record for both patient conditions good and poor Hospital A Hospital B 8 6 592 594 600 600 98 7 99 0 Died Survived Total surv Patients in poor condition Died Survived Total surv Hospital A Hospital B 8 192 200 96 0 57 1443 1500 96 2 Here patient condition was the lurking variable
View Full Document