Unformatted text preview:

Homework Assignment #536-350, Data MiningDue at the start of lecture, 23 October 20091. The state.x77 data set is available by default in R; it’s a compilation ofdata about the US states put together from the 1977 Statistical Abstractof the United States, with the actual measurements mostly made a fewyears before.1The variables are:Population in thousandsIncome dollars per capitaIlliteracy percentage of adults unable to read and writeLife Exp average years of life expectancy at birthMurder number of murders and non-negligent manslaughters per 100,000 peopleHS Grad percentage of adults who were high-school graduatesFrost mean number of days per year with low temperatures below freezingArea in square mileshelp(state.x77) has a little more detail. Also built in to R are state.center,giving the longitude and latitude of the geographic center of each state(except for Alaska and Hawaii, which are artificially put somewhere offthe west coast), state.name for the names of the states, and state.abbfor the names’ two-letter abbreviations.(a) Create a plot showing the location of each state, with longitude on thehorizontal axis, latitude on the vertical axis, and the states’ namesor abbreviations in the appropriate positions. Include your code.(b) Using the factanal command from R with the scores="regression"option, do a one-factor analysis of state.x77. Include the commandyou used and R’s output.(c) Describe the factor you obtained in the previous part in terms of theobservable features.1The Statistical Abstract is “the best book published in America” (P. Krugman), an im-mensely valuable compilation of data about a huge range of aspects of American life, put outevery year by the Census Bureau. It’s available for free online at http://www.census.gov/compendia/statab/.1(d) Plot the states by location, with the labels of the states being alinearly increasing function of their factor scores. You should controlthe minimum and maximum size of the labels. (Remember thatmany of the factor scores will be negative.) Include your code, andcomment on the map it produces. Hint: The cex option to functionslike text can be a vector.Alternately, use the scatterplot3d command, from the package ofthat name, to make a three-dimensional plot, with the z axis beingthe factor score. If you do this, make sure to orient the plot so it islegible, and the states are clearly distinguished.(e) Part of the output of the factanal command is the p-value of thelikelihood ratio test for comparing the fitted factor model to theunrestricted multivariate Gaussian. Plot this p-value against q, thenumber of factors. Include your code.(f) Is it plausible that there is really only one factor? Explain, and justifyyour answer in terms of R’s output, not your general knowledge ofUS geography.2. Install (if you haven’t already) the packages ElemStatLearn and scatterplot3dfrom CRAN. The data set for this problem is zip.train in ElemStatLearn.This consists of scans of about 7000 hand-written numeric digits from zipcodes on envelopes, scanned in as 16 × 16 grey-scale images. Each row ofthe data frame represents a different digit; the first column is the actualdigit (as verified by a human being), and the other 256 columns are thegrey-scale values of the different pixels (centered around zero).2The digitsare the classes. Some parts of this problem may take excessively long torun if you use all rows of the data set; it’s OK to use just the first 500rows, but if so, indicate that’s what you’re doing.(a) Do a PCA of zip.train, being sure to omit the first column. Whatcommand do you use? Why should you omit the first column?(b) Make plots of the projections of the data on to the first two andthree principal components. (For the 3D plot, use the functionscatterplot3d from that package.) Include the commands you usedas well as the plots. On both plots, which points come from whichdigits, and make sure that this is legible in what you turn in. (E.g.,if you use colors, make sure they look distinct on your printout. Youmight try pch=as.character(zip.train[,1]).) Comment on theresults.(c) Use the code from lecture to do an LLE with q = 3. Include thecommands you used.(d) Make 2D and 3D plots of the data, as before, but with the LLEcoordinates. Comment.2You can visualize them with the function zip2image; see the example at the end ofhelp(zip.train). This is not needed for the problem.2(e) Run k-means with k = 10 on (i) the raw data, (ii) the 3D PCA pro-jections and (iii) the 3D LLE. Calculate the variation-of-informationdistance of all three clusterings from the true classes (as given by thefirst column of zip.train). Comment.3. (Extra credit) Download the diffusionMap package from CRAN. Preparea 3D scatterplot of the data, as in problem 2, using diffuse. Repeat theclustering from the end of problem 2 with the diffusionKmeans func-tion, and calculate the distance of this clustering from the true classes.Comment on these


View Full Document

FSU ARH 5361 - Homework

Download Homework
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Homework and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Homework 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?