STOR 664 FALL 2020 Final Exam November 20 2020 This is an open book remote learning exam Time limit 6 hours Access to course materials and standard computational tools in particular R is allowed communication with other students or with anybody via the internet other than the instructor is not The university Honor Code is in e ect at all times Answers may be given in any of the following formats including combinations of more than one R Markdown Word Latex or handwritten pages scanned or photographed if handwritten it is recommended you use blue or black ink on plain sheets of white paper They should then be uploaded in gradescope The exam is worth 100 points total 30 for questions 1 and 2 40 for question 3 points for each part question are stated below Although the questions are intended to be answered in sequence you may write out your answers in any order and errors in one part question will not prevent you gaining full credit in other parts of the same question Attempt all questions X1 58 8 65 2 70 9 77 4 79 3 81 0 71 9 63 9 54 5 X2 X3 X4 X5 52 68 29 23 40 14 96 94 54 129 141 153 166 193 189 175 186 190 7107 6373 6796 9208 14792 14564 11964 13526 12656 21 22 22 20 25 23 20 23 20 Y 3067 2828 2891 2994 3082 3898 3502 3060 3211 X1 39 5 44 5 43 6 56 0 64 7 73 0 78 9 79 4 X2 X3 X4 X5 37 42 22 28 7 42 33 92 187 195 206 198 192 191 200 200 14119 16691 14571 13619 14575 14556 18573 15618 20 22 19 22 22 21 21 22 Y 3286 3542 3125 3022 2922 3950 4488 3295 Table 1 Water usage data set X1 Average monthly temperature oF X2 Average of production thousands of pounds X3 Number of operating plant days in the month X4 Number of persons on the monthly plant payroll X5 Two digit random number Y Monthly water usage gallons Table 2 Variables used in Table 1 1 A production plant cost control engineer is responsible for cost reduction One of the costly items in her plant is the amount of water used by the production facilities each month She decides to investigate water usage by collecting 17 observations of the plant s water usage and other variables She had heard about multiple regression but since she was quite skeptical she added a column of random numbers to the original observations Then she asked her team s statistician to analyze the data and make recommendations about the control variables 1 X1 X5 The complete set of data is shown in Table 1 The variables are described in Table 2 You are the team s statistician Analyze the data with appropriate consideration for variable selection diagnostics transformations and multicollinearity The objective is to nd the best model for predicting Y as a function of X1 X5 Write a report summarizing the results of this exercise discuss as many as possible of the following items In particular your report should a The best regression model with variables selected from X1 X5 7 points b The goodness of t of the model with all ve variables included taking into account the various diagnostics available within R 8 points c The interpretation of statistics regarding outliers points of high leverage in uence di agnostics and multicollinearity 8 points d The nal conclusions that the engineer should draw In particular state which variables are predictors of water usage whether the associations are positive or negative and the degree of con dence you have in each of your conclusions 7 points 2 A dataset was collected on median housing prices in 506 neighorhoods of a large city to gether with a number of potential explanatory variables The dataset contains the following variables cid 136 crim per capita crime rate by town cid 136 zn proportion of residential land zoned for lots over 25 000 sq ft cid 136 indus proportion of non retail business acres per town cid 136 chas Charles River dummy variable 1 if tract bounds river 0 otherwise cid 136 nox nitrogen oxides concentration parts per 10 million cid 136 rm average number of rooms per dwelling cid 136 age proportion of owner occupied units built prior to 1940 cid 136 dis weighted mean of distances to ve Boston employment centres cid 136 rad index of accessibility to radial highways cid 136 tax full value property tax rate per 10 000 cid 136 ptratio pupil teacher ratio by town cid 136 black 1000 Bk 0 63 2 where Bk is the proportion of blacks by town cid 136 lstat lower status of the population percent cid 136 medv median value of owner occupied homes in 1000s The object of the exercise is to predict mdev as a function of all the other variables For the purpose of the present exercise the dataset has been split randomly into two parts housing train csv which is a training dataset of 400 neighborhoods and housing test csv which is a test dataset of 400 neighborhoods Both are csv les and may be read into R with the read csv command 2 a Based on the training dataset nd the best model by variable selection and use cross validation to evaluate its performance mean square prediction error or MSPE Then validate your conclusion by running it on the test dataset and calculating the MSPE for that 6 points b Repeat the exercise of a using lasso regression 6 points c Repeat the exercise of a using ridge regression 6 points d Repeat the exercise of a using elastic net regression with 0 5 6 points e Compare and contrast the four methods for nding a predictor of mdev 6 points 3 An experiment was performed to compare two treatments that are used in spinning cotton thread One is Flyer 1 or 2 which represent two di erent machines for spinning the ber The other is Twist 1 2 3 or 4 which indicates four levels of twisting the ber The original intention was to use all eight possible combinations of Flyer and Twist but it was found that the combinations Flyer 1 Twist 1 and Flyer 2 Twist 4 were unstable so these were not used The experiment was laid out in 13 blocks as in Table 3 Flyer Twist B1 B2 B3 B4 11 5 8 8 10 6 6 9 3 3 4 1 9 7 8 3 3 3 6 4 6 4 4 6 7 4 7 9 7 3 4 1 8 3 5 0 6 6 4 2 3 3 3 3 7 4 2 2 3 4 1 2 3 1 1 1 2 2 2 B5 17 9 10 1 7 9 6 0 7 8 5 5 B6 11 9 11 5 5 5 7 4 5 9 3 2 Block B7 B8 B9 B10 B11 B12 B13 8 7 10 2 12 4 8 7 12 0 7 8 6 0 7 …
View Full Document