FSU ARH 5361 - Homework - D1195685

Home> Schools> Florida State University> Art (ARH) > ARH 5361> Homework

FSU ARH 5361 - Homework

Course Arh 5361- Northern Baroque Art (3).

Pages 3

Download Save

Unformatted text preview:

Homework Assignment #536-350, Data MiningDue at the start of lecture, 23 October 20091. The state.x77 data set is available by default in R; it’s a compilation ofdata about the US states put together from the 1977 Statistical Abstractof the United States, with the actual measurements mostly made a fewyears before.1The variables are:Population in thousandsIncome dollars per capitaIlliteracy percentage of adults unable to read and writeLife Exp average years of life expectancy at birthMurder number of murders and non-negligent manslaughters per 100,000 peopleHS Grad percentage of adults who were high-school graduatesFrost mean number of days per year with low temperatures below freezingArea in square mileshelp(state.x77) has a little more detail. Also built in to R are state.center,giving the longitude and latitude of the geographic center of each state(except for Alaska and Hawaii, which are artificially put somewhere offthe west coast), state.name for the names of the states, and state.abbfor the names’ two-letter abbreviations.(a) Create a plot showing the location of each state, with longitude on thehorizontal axis, latitude on the vertical axis, and the states’ namesor abbreviations in the appropriate positions. Include your code.(b) Using the factanal command from R with the scores="regression"option, do a one-factor analysis of state.x77. Include the commandyou used and R’s output.(c) Describe the factor you obtained in the previous part in terms of theobservable features.1The Statistical Abstract is “the best book published in America” (P. Krugman), an im-mensely valuable compilation of data about a huge range of aspects of American life, put outevery year by the Census Bureau. It’s available for free online at http://www.census.gov/compendia/statab/.1(d) Plot the states by location, with the labels of the states being alinearly increasing function of their factor scores. You should controlthe minimum and maximum size of the labels. (Remember thatmany of the factor scores will be negative.) Include your code, andcomment on the map it produces. Hint: The cex option to functionslike text can be a vector.Alternately, use the scatterplot3d command, from the package ofthat name, to make a three-dimensional plot, with the z axis beingthe factor score. If you do this, make sure to orient the plot so it islegible, and the states are clearly distinguished.(e) Part of the output of the factanal command is the p-value of thelikelihood ratio test for comparing the fitted factor model to theunrestricted multivariate Gaussian. Plot this p-value against q, thenumber of factors. Include your code.(f) Is it plausible that there is really only one factor? Explain, and justifyyour answer in terms of R’s output, not your general knowledge ofUS geography.2. Install (if you haven’t already) the packages ElemStatLearn and scatterplot3dfrom CRAN. The data set for this problem is zip.train in ElemStatLearn.This consists of scans of about 7000 hand-written numeric digits from zipcodes on envelopes, scanned in as 16 × 16 grey-scale images. Each row ofthe data frame represents a different digit; the first column is the actualdigit (as verified by a human being), and the other 256 columns are thegrey-scale values of the different pixels (centered around zero).2The digitsare the classes. Some parts of this problem may take excessively long torun if you use all rows of the data set; it’s OK to use just the first 500rows, but if so, indicate that’s what you’re doing.(a) Do a PCA of zip.train, being sure to omit the first column. Whatcommand do you use? Why should you omit the first column?(b) Make plots of the projections of the data on to the first two andthree principal components. (For the 3D plot, use the functionscatterplot3d from that package.) Include the commands you usedas well as the plots. On both plots, which points come from whichdigits, and make sure that this is legible in what you turn in. (E.g.,if you use colors, make sure they look distinct on your printout. Youmight try pch=as.character(zip.train[,1]).) Comment on theresults.(c) Use the code from lecture to do an LLE with q = 3. Include thecommands you used.(d) Make 2D and 3D plots of the data, as before, but with the LLEcoordinates. Comment.2You can visualize them with the function zip2image; see the example at the end ofhelp(zip.train). This is not needed for the problem.2(e) Run k-means with k = 10 on (i) the raw data, (ii) the 3D PCA pro-jections and (iii) the 3D LLE. Calculate the variation-of-informationdistance of all three clusterings from the true classes (as given by thefirst column of zip.train). Comment.3. (Extra credit) Download the diffusionMap package from CRAN. Preparea 3D scatterplot of the data, as in problem 2, using diffuse. Repeat theclustering from the end of problem 2 with the diffusionKmeans func-tion, and calculate the distance of this clustering from the true classes.Comment on these

View Full Document


School:
Email:
New Password:
Confirm Password:

FSU ARH 5361 - Homework

Sign up for free to view:

Please select your school