Unformatted text preview:

ObjectivesExploratory analysis for categorical dataWorking with factors in RData wranglingExploratory analysisKNN and Logistic regressionKNNLogistic regressionLab 3: Classification EDA, KNN, and logistic regressionPSTAT131-231Week 3Objectives• Explore visualization options for categorical data• Fit KNN classifiers and use cross-validation for selection of k• Fit logistic regression models and visualize model outputsExploratory analysis for categorical dataWe’ll work with thegss_catsdata from theforcatspackage to illustrate some possibilities for informativeexploratory plots in the classification setting. This data comprise responses to a subset of questions on theGeneral Social Survey in the years 2000 – 2014. We’ll look at three years of data.# 2007 datagss <- filter(gss_cat, year %in% c(2006, 2010, 2014))Working with factors in RAside fromyear(year of survey) andtvhours(hours spent watching TV), the remaining variables are factors– categorical variables with named levels. Check the structure of thepartyidvariable for political partyaffiliation:# structure of a factorstr(gss$partyid)## Factor w/ 10 levels "No answer","Don't know",..: 10 6 10 10 10 9 7 10 10 9 ...# names of levelslevels(gss$partyid)## [1] "No answer" "Don't know" "Other party"## [4] "Strong republican" "Not str republican" "Ind,near rep"## [7] "Independent" "Ind,near dem" "Not str democrat"## [10] "Strong democrat"The actual values are integers that code for the factor levels. Coercing a factor to a numeric variable yieldssimply the codings:# coerce to doubleas.numeric(gss$partyid) %>% head()## [1] 10 6 10 10 10 9It’s easy to tabulate the data by factor levels using count:1# tabulate party affiliationsgss %>% count(partyid) %>% mutate(prop = n/nrow(gss))## # A tibble: 10 x 3## partyid n prop## <fct> <int> <dbl>## 1 No answer 67 0.00737## 2 Don't know 1 0.000110## 3 Other party 176 0.0194## 4 Strong republican 924 0.102## 5 Not str republican 1206 0.133## 6 Ind,near rep 773 0.0850## 7 Independent 1859 0.204## 8 Ind,near dem 1129 0.124## 9 Not str democrat 1490 0.164## 10 Strong democrat 1467 0.161Let’s imagine we’re interested in predicting party affiliation based on other variables. We can combine thefactor levels into independent, democrat, republican, and other:old_party_levels <- levels(gss$partyid)gss <- gss %>%mutate(partyid = fct_collapse(partyid,'oth' = old_party_levels[1:3],'rep' = old_party_levels[4:5],'ind' = old_party_levels[6:8],'dem' = old_party_levels[9:10]))gss %>% count(partyid) %>% mutate(prop = n/nrow(gss)) %>% pander()partyid n propoth 244 0.02684rep 2130 0.2343ind 3761 0.4137dem 2957 0.3252Your turnThereligvariable lists 16 distinct religious categories. In order of representation in the data,these are:gss %>% count(relig) %>%arrange(desc(n))## # A tibble: 15 x 2## relig n## <fct> <int>## 1 Protestant 4426## 2 Catholic 2202## 3 None 1624## 4 Christian 326## 5 Jewish 155## 6 Other 86## 7 Buddhism 75## 8 No answer 53## 9 Moslem/islam 372## 10 Orthodox-christian 31## 11 Inter-nondenominational 30## 12 Hinduism 28## 13 Other eastern 8## 14 Native american 7## 15 Don't know 4Recodereligwith one level for each of the five most represented groups and an ‘all other’ level. (Hint: formore efficient code, try fct_lump_n.)# recode religion variableold_relig_levels <- levels(gss$relig)gss$relig %>% fct_lump_n(5) %>% table()## .## Christian None Jewish Catholic Protestant Other## 326 1624 155 2202 4426 359Now check the reported income variablerincome. Notice that this factor has an inherent ordering, asidefrom some levels representing missing data of various types.# last level is 'out of order'levels(gss$rincome)## [1] "No answer" "Don't know" "Refused" "$25000 or more"## [5] "$20000 - 24999" "$15000 - 19999" "$10000 - 14999" "$8000 to 9999"## [9] "$7000 to 7999" "$6000 to 6999" "$5000 to 5999" "$4000 to 4999"## [13] "$3000 to 3999" "$1000 to 2999" "Lt $1000" "Not applicable"Let’s reorder the factor so that the ‘not applicable’ level appears together with ‘no answer’, ‘don’t know’, and‘refused’.# reorder reported incomegss <- gss %>%mutate(rincome = fct_relevel(rincome, 'Not applicable'))levels(gss$rincome)## [1] "Not applicable" "No answer" "Don't know" "Refused"## [5] "$25000 or more" "$20000 - 24999" "$15000 - 19999" "$10000 - 14999"## [9] "$8000 to 9999" "$7000 to 7999" "$6000 to 6999" "$5000 to 5999"## [13] "$4000 to 4999" "$3000 to 3999" "$1000 to 2999" "Lt $1000"Now let’s group those categories together.# group non-ordered categories togetheradj_rincome_levels <- levels(gss$rincome)gss <- gss %>%mutate(rincome = fct_other(rincome, drop = adj_rincome_levels[1:4]))Data wranglingFor exploratory analysis, we’ll consider political party affiliation as our variable of interest and investiage itsrelationship to marital status, age, race, reported income, and religion (after the reorderings above).# select variables of interestgss <- gss %>% select(-tvhours, -denom)As you saw above, there is a nonresponse category for reported income. There is also one for marital status.We’ll filter out those observations before going ahead, but first let’s check to see if there’s a pattern to3nonresponse. There aren’t that many observations for which the respondent gave ‘No answer’ to maritalstatus.gss %>% count(marital)## # A tibble: 6 x 2## marital n## <fct> <int>## 1 No answer 11## 2 Never married 2320## 3 Separated 302## 4 Divorced 1484## 5 Widowed 756## 6 Married 4219But there are quite a few nonresponses for reported income.gss %>% count(rincome)## # A tibble: 13 x 2## rincome n## <fct> <int>## 1 $25000 or more 3173## 2 $20000 - 24999 498## 3 $15000 - 19999 396## 4 $10000 - 14999 452## 5 $8000 to 9999 125## 6 $7000 to 7999 70## 7 $6000 to 6999 94## 8 $5000 to 5999 99## 9 $4000 to 4999 101## 10 $3000 to 3999 118## 11 $1000 to 2999 154## 12 Lt $1000 114## 13 Other 3698Your turnMake a summary table that helps you investigate whether there’s any apparent pattern tononresponse in the variable of interest. Are respondents with certain party affiliations not reporting theirincome on the survey more often than others? Avoid examining raw counts, as the numbers of respondentsof each party affiliation are not equal. (Hint: consider finding the proportion of respondents of each partyaffiliation that didn’t report their income.)gss %>%count(rincome, partyid) %>%group_by(partyid) %>%mutate(prop = n/sum(n)) %>%pander()rincome partyid n prop$25000 or more oth 84


View Full Document

UCSB PSTAT 131 - Lab 3

Download Lab 3
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lab 3 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lab 3 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?