Unformatted text preview:

Midterm Examination 36 350 Data Mining 14 October 2009 No notes or calculators are allowed All calculations can be done by hand possibly but not necessarily using the facts on this page SHOW YOUR WORK partial credit will be based on work correct answers without work will receive minimal or no credit If you suspect you have made a mistake but cannot find it say so and say why you think there is an error Problem 1 2 3 4 Points 15 35 25 25 Possibly Helpful Facts here x is an m 1 matrix and A and B are m m dxT x dx dxT Ax dx Ax vx T AB 2x Ax AT x x is an eigenvector of A with eigenvalue v BT AT 1 1 15 points in all Briefly define the following terms 2 pt each Formulas are OK but explain what the symbols in them mean a Ward s method of clustering b Entropy c Inverse document frequency d Cross validation e Nearest neighbor classifier f Dendrogram g Confusion matrix 2 2 Finding reviewers 35 pts total Scientific papers submitted to a journal or conference are peer reviewed meaning that they are evaluated by other scientists familiar with work in the area Journal editors and conference organizers spend a lot of time selecting reviewers and authors worry about getting good referees Suppose that a journal has a database of the full text of all papers previously published in the journal along with their authors a 3 pts Explain what the bag of words representation for an individual paper would be b 2 pts Explain how to combine the representations of all papers by a given author to get a bag of words for that author c 15 pts Describe an algorithm for finding the three authors whose work is most relevant to a given paper and are not authors of the paper You do not have to write code but be clear about what needs to be done d 5 pts How could you use principal components analysis of bags of words to simplify and improve this system e 5 pts Describe how to use the bags of words to hierarchically cluster authors f 5 pts Describe another algorithm for finding peer reviewers of a paper using the hierarchical clustering of authors 3 3 25 points in all state x77 is a data set about the United States in 1977 using figures taken from the Census s Statistical Abstract You will see this again in the homework The variables are Population Income Illiteracy Life Exp Murder HS Grad Frost Area in thousands dollars per capita Percent of the adult population unable to read and write Average years of life expectancy at birth Number of murders and non negligent manslaughters per 100 000 people Percent of adults who were high school graduates Mean number of days per year with low temperatures below freezing In square miles The summary statistics for these variables will be helpful summary state x77 Population Income Min 365 Min 3098 1st Qu 1080 1st Qu 3993 Median 2838 Median 4519 Mean 4246 Mean 4436 3rd Qu 4968 3rd Qu 4814 Max 21198 Max 6315 Murder HS Grad Min 1 400 Min 37 80 1st Qu 4 350 1st Qu 48 05 Median 6 850 Median 53 25 Mean 7 378 Mean 53 11 3rd Qu 10 675 3rd Qu 59 15 Max 15 100 Max 67 30 Illiteracy Life Exp Min 0 500 Min 67 96 1st Qu 0 625 1st Qu 70 12 Median 0 950 Median 70 67 Mean 1 170 Mean 70 88 3rd Qu 1 575 3rd Qu 71 89 Max 2 800 Max 73 60 Frost Area Min 0 00 Min 1049 1st Qu 66 25 1st Qu 36985 Median 114 50 Median 54277 Mean 104 46 Mean 70736 3rd Qu 139 75 3rd Qu 81162 Max 188 00 Max 566432 We will do two different principal component analyses of this data states pca 1 prcomp state x77 scale FALSE states pca 2 prcomp state x77 scale TRUE The figures following show some displays for these two PCAs which you will need to use to answer the questions 4 2e 05 0e 00 2e 05 4e 05 6e 05 2e 05 0 2 2e 05 0 2 0 4 Area 0e 00 Alaska Wyoming Vermont Nevada Montana North Dakota South Dakota Delaware New Hampshire Idaho Hawaii Rhode Island New Mexico Maine Utah Nebraska West Virginia Arkansas Oregon Arizona Mississippi Kansas Colorado Oklahoma South Carolina Iowa Connecticut Kentucky Washington Alabama Louisiana Minnesota Maryland Illiteracy Life Exp Tennessee Frost Murder HS Grad Income Wisconsin Missouri Georgia Virginia Indiana North Carolina Population Massachusetts New Jersey Florida Michigan Ohio Illinois Pennsylvania Texas 0 0 New York 4e 05 PC2 0 4 4e 05 0 6 0 8 6e 05 4e 05 California 0 4 0 2 0 0 0 2 0 4 0 6 0 8 PC1 Figure 1 Biplot for states pca 1 Population Income Illiteracy Life Exp Murder HS Grad Frost Area PC1 1 18 10 03 2 62 10 3 5 52 10 7 1 69 10 6 9 88 10 6 3 16 10 5 3 61 10 5 1 00 PC2 1 00 2 80 10 2 1 42 10 5 1 93 10 5 2 79 10 4 1 88 10 4 3 87 10 3 1 26 10 3 Table 1 Projections of the features on to the first two principal components of states pca 1 5 5000 0 AK WY MT VT SD DE ID NV NH HI ME ND RI NM UT WV ARNE AZ OR MS KS CO IAOK CT SC KY WA AL LA MD TN MN WI MO GA VA IN NC MA 5000 FL MI OH IL PA TX 15000 10000 PC2 NJ NY CA 0e 00 1e 05 2e 05 3e 05 4e 05 5e 05 PC1 Figure 2 Projections of the states on to the first two principal components of states pca 1 6 4e 09 2e 09 0e 00 Variances 6e 09 states pca 1 1 2 3 4 5 6 scree plot Figure 3 Scree plot for states pca 1 7 7 8 5 0 5 10 15 0 4 15 0 6 Alaska 10 California 0 2 5 Area Income Illinois Florida Nevada Population Michigan Murder HS Grad Arizona Washington Colorado Ohio Maryland NewPennsylvania Jersey Oregon Georgia Hawaii Virginia Illiteracy Wyoming Montana Kansas Missouri Life Exp Massachusetts Alabama Connecticut Indiana New Mexico Louisiana Minnesota Frost Delaware North Carolina Utah Iowa Nebraska Idaho Oklahoma Tennessee Wisconsin North Dakota South Carolina Kentucky Mississippi Arkansas New Hampshire South Dakota Rhode Island West Virginia Vermont Maine 0 2 5 0 0 0 PC2 Texas New York 0 2 0 0 0 2 0 4 0 6 PC1 Figure 4 Biplot for states pca 2 Population Income Illiteracy Life Exp Murder HS Grad Frost Area PC1 0 1260 0 2990 0 4680 0 4120 0 4440 0 4250 0 3570 0 0334 PC2 0 4110 0 5190 0 0530 0 0817 0 3070 0 2990 0 1540 0 5880 Table 2 Projections of the features on to …


View Full Document

CMU STA 36350 - midterm

Loading Unlocking...
Login

Join to view midterm and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view midterm and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?