DOC PREVIEW
UNL STAT 870 - 9.6 Model validation

This preview shows page 1-2-3-4 out of 12 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

9.6 Model validation Final step in the model building process is validation of the selected regression model. Why does KNN put this section in Chapter 9 if it is the “last step”? I am not for sure! The section used to be after all of Chapter 10, which probably is its better place! Methods:1)Collect new data (preferred method)a)Re-estimate the model parameters on the new data setCompare how close the parameter estimates are from the old and new data sets. If parameter estimates are close, then the model is fine. b)Measure the predictive capability of the model on the new data set. Use the mean squared prediction error (MSPR):n2i ii 1ˆ(Y Y )MSPRn*=*-�=where Yi is response variable from the new data set 2012 Christopher R. Bilder9.6.1iˆY is the predicted Yi value (use original model bj's) n* is the number of observations in the new data setIf MSPR is relatively close to MSE, then the model is fine. If MSPR is much larger than MSE, then MSE is a biased estimate of 2. c)Examine semi-studentized residuals for the new data setto check for outliers. i iiˆY YeMSEwhere Yi is response variable from the new data setiˆY is the predicted Yi value (use original model bj's) MSE is the mean square error calculated on the original data setThis can help detect observations that are not being predicted adequately by the model. 2)Use a holdout sample (also called data splitting)Since collecting new data is often not feasible, the original data set is often split into “model building” and “validation” data sets.  2012 Christopher R. Bilder9.6.2 Model building data set – Used to construct the model upon. Also referred to as a “training” data set in other statistical applications.  Validation data set – Used to evaluate the model. Alsoreferred to as a “test” data set in other statistical applications. Notes: 1.The data set needs to be split at beginning of the model building process. 2.Want the model building data set to be as large as possible so that good estimates of the parameters andtheir standard errors can be obtained 3.Want the validation data set to be large enough so thatenough data is available to evaluate the model4.KNN recommends having a number of observations atleast 6 to 10 times the number of predictor variables inthe variable pool for the model building data set.5.Other recommendations that I have seen are using 80% in the model building and 20% in the validation. 6.The data set needs to be randomly split so that no biases are introduced into the model building process. 7.The validation set can be treated in the same way as anew data set – see 1. Example: NBA guard data (NBA_ch10_validation.R) 2012 Christopher R. Bilder9.6.3Additional data was collected for the 1998-9 season. Note that it would be preferable to have data from the 1992-3 season. However, as is frequently the case with sports data, one would like to make statements about future seasons using a model based on past data so thisis a good data set to use for model validation. > #1992-3 data> nba<-read.table(file = "C:\\chris\\UNL\\STAT870\\Chapter6\\nba_data.txt", header=TRUE, sep = "")> head(nba) last.name first.initial games PPM MPG height FTP FGP age1 Abdul-Rauf M. 80 0.5668 33.8750 185 93.5 45.0 242 Adams M. 69 0.4086 36.2174 178 85.6 43.9 303 Ainge D. 81 0.4419 26.7037 196 84.8 46.2 344 Anderson K. 55 0.4624 36.5455 185 77.6 43.5 235 Anthony G. 70 0.2719 24.2714 188 67.3 41.5 266 Armstrsong B.J. 81 0.3998 30.7654 188 86.1 49.9 26 > #Model chosen on 1992-3 data> mod.fit<-lm(formula = PPM ~ MPG + height + FGP + age + I(MPG^2) + MPG:age, data = nba)> sum.fit<-summary(mod.fit)> sum.fitCall:lm(formula = PPM ~ MPG + height + FGP + age + I(MPG^2) + MPG:age, data = nba)Residuals: Min 1Q Median 3Q Max -0.176248 -0.060386 -0.006655 0.059309 0.186663 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -2.654e-01 2.847e-01 -0.932 0.353570 MPG -3.940e-02 7.657e-03 -5.146 1.37e-06 ***height 4.821e-03 1.189e-03 4.056 0.000100 ***FGP 1.096e-02 2.048e-03 5.350 5.76e-07 ***age -2.277e-02 6.629e-03 -3.436 0.000869 ***I(MPG^2) 3.952e-04 9.821e-05 4.024 0.000113 *** 2012 Christopher R. Bilder9.6.4MPG:age 8.752e-04 2.838e-04 3.084 0.002651 ** ---Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' '1 Residual standard error: 0.08447 on 98 degrees of freedomMultiple R-Squared: 0.4993, Adjusted R-squared: 0.4687 F-statistic: 16.29 on 6 and 98 DF, p-value: 6.285e-13 > #1998-9 data> nba.98.99<-read.table(file = "C:\\chris\\UNL\\STAT870\\Chapter10\\nba_98_99.txt", header=TRUE, sep = "")> head(nba.98.99) last.name first.name PPM MPG height FTP FGP age1 Abdul-Wahad Tariq 0.38 24.6 198 69.1 43.5 24.912 Anderson Kenny 0.41 29.7 185 83.2 45.1 28.983 Askins Keith 0.13 12.6 203 62.5 32.3 31.794 Barry Jon 0.29 17.1 196 84.5 42.8 30.185 Beck Corey 0.27 7.0 191 53.3 46.2 28.346 Bibby Mike 0.38 35.2 188 75.1 43.0 21.38 > #Predictions for 1998-9 data using the 1992-3 data> pred.98.99<-predict(object = mod.fit, newdata= nba.98.99)> mspr<-sum((nba.98.99$PPM - pred.98.99)^2)/nrow(nba.98.99)> mspr[1] 0.01115384> #Fit same model to the 1998-9 data> mod.fit.98.99<-lm(formula = PPM ~ MPG + height + FGP + age + I(MPG^2) + MPG:age, data = nba.98.99)> summary(mod.fit.98.99)Call:lm(formula = PPM ~ MPG + height + FGP + age + I(MPG^2) + MPG:age, data = nba.98.99)Residuals: Min 1Q Median 3Q Max -0.23021 -0.05994 0.01105 0.06283 0.20822 Coefficients: 2012 Christopher R. Bilder9.6.5Estimate Std. Error t value Pr(>|t|) (Intercept) -0.5092041 0.4413453 -1.154 0.253979 MPG -0.0273591 0.0094034 -2.909 0.005352 ** height 0.0030284 0.0019673 1.539 0.129889 FGP 0.0138088 0.0020565 6.715 1.54e-08 ***age -0.0022197 0.0064388 -0.345 0.731708 I(MPG^2) 0.0005564 0.0001382 4.027 0.000188 ***MPG:age 0.0001017 0.0002661 0.382 0.703888 ---Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' '1 Residual standard error: 0.09763 on 51 degrees of freedomMultiple


View Full Document

UNL STAT 870 - 9.6 Model validation

Download 9.6 Model validation
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view 9.6 Model validation and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view 9.6 Model validation 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?