UW-Madison STAT 572 - Model Selection and Multicollinearity - D1709201

Home> Schools> University of Wisconsin, Madison> Statistics (STAT) > STAT 572> Model Selection and Multicollinearity

DOC PREVIEW

UW-Madison STAT 572 - Model Selection and Multicollinearity

School name University of Wisconsin, Madison

Course Stat 572- Statistical Methods for Bioscience II

Pages 7

This preview shows page 1-2 out of 7 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 7 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 7 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 7 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Case StudySAT ScoresThe Big PictureOverviewGeometric Viewpoint of RegressionModel EvaluationR2Adjusted R2Variable SelectionMaximum LikelihoodAICBICComputingMulticollinearityCorrelationHigher DimensionsModel Selection and MulticollinearityBret LargetDepartments of Botany and of StatisticsUniversity of Wisconsin—MadisonFebruary 20, 2007Statistics 572 (Spring 2007) Multiple Linear Regression February 20, 2007 1 / 14Case Study SAT ScoresSAT scoresData analysis illustrates model selection and multicollinearity.Data set from 1982 on all fifty states.Variables:sat: State average SAT score (verbal plus quantitative)takers: Percentage of eligble students that take the examincome: Median family income of test takers ($100)years: Average total high school courses in English, science, history,mathematicspublic: Percentage of test takers attending public high schoolexpend: Average state dollars spent per high school student ($100)rank: Median percentile rank of test takersStatistics 572 (Spring 2007) Multiple Linear Regression February 20, 2007 2 / 14The Big Picture OverviewThe Big PictureWhen there are many possible explanatory variables, often timesseveral models are nearly equally good at explaining variation in theresponse variable.R2and adjusted R2measure closeness of fit, but are poor criteria forvariable selection.AIC and BIC are sometimes used as objective criteria for modelselection.Stepwise regression searches for best models, but does not always findthem.Models selected by AIC or BIC are often overfit.Tests after model selection are not valid, typically.Parameter interpretation is complex.Statistics 572 (Spring 2007) Multiple Linear Regression February 20, 2007 3 / 14The Big Picture Geometric Viewpoint of RegressionGeometryConsider a data set with n individuals, each with a response variabley, k explanatory variables x1, . . . , xk, plus an intercept 1.This is an n × (k + 2) matrix.Each row is a point in k + 1 dimensional space (if we do not plot theintercept).We can also think of each column as a vector (ray from the origin) inn dimensional space.The explanatory variables plus the intercept define a k + 1 dimensionalhyper-plane in this space. (This is caled the column space of X .)Statistics 572 (Spring 2007) Multiple Linear Regression February 20, 2007 4 / 14The Big Picture Geometric Viewpoint of RegressionGeometry (cont.)The vector y = ˆy + r where r is the residual vector.In least squares regression, the fitted value ˆy is the orthogonalprojection of y into the column space of X .The residual vector r is orthogonal (perpendicular) to the columnspace of X .Two vectors are orthogonal if their dot product equals zero.The dot product of w = (w1, . . . , wn) and z = (z1, . . . , zn) isPni=1wizi.r is orthogonal to every explanatory variable including the intercept.This explains why the sum of residuals is zero when there is anintercept.Understanding least squares regression as projection into a smallerspace is helpful for developing intuition about linear models, degreesof freedom, and variable selection.Statistics 572 (Spring 2007) Multiple Linear Regression February 20, 2007 5 / 14Model Evaluation R2R2The R2statistic is a generalization of the square of the correlationcoefficient.R2can be interpreted as the proportion of the variance in y explainedby the regression.R2=SSRegSSTot= 1 −SSErrSSTotEvery time a new explanatory variable is added to a model, the R2increases.Statistics 572 (Spring 2007) Multiple Linear Regression February 20, 2007 6 / 14Model Evaluation Adjusted R2Adjusted R2Adjusted R2is an attempt to account for additional variables.adj R2= 1 −MSErrMSTot= 1 −SSErr/(n − k − 1)SSTot/(n − 1)= 1 −n − 1n − k − 1SSErrSSTot= 1 −n − 1n − k − 11 − R2The model with the best adjusted R2has the smallest ˆσ2.Statistics 572 (Spring 2007) Multiple Linear Regression February 20, 2007 7 / 14Variable Selection Maximum LikelihoodMaximum LikelihoodThe probability of observable data is represented by a mathematicalexpression relating parameters and data values.For fixed parameter values, the total probability is one.Likelihood is the same expression for this probability of the observeddata, but is considered as a function of the parameters with the datafixed.The principle of maximum likelihood is to estimate parameters bymaking the likelihood (probability of the observed data) as large aspossible.In regression, least squares estimatesˆβiare also maximum likelihoodestimates.Likelihood is only defined up to a constant, typically.Statistics 572 (Spring 2007) Multiple Linear Regression February 20, 2007 8 / 14Variable Selection AICAICAkaike’s Information Criterion (AIC) is based on maximum likelihoodand a penalty for each parameter.The general form isAIC = −2 log L + 2pwhere L is the likelihood and p is the number of parameters.In multiple regression, this becomesAIC = n logRSSn+ 2p + Cwhere RSS is the residual sum of squares and C is a constant.In R, the functions AIC and extractAIC define the constantdifferently.We only care about differences in AIC, so this does not matter (solong as we consistently use one or the other).The best model by this criterion minimizes AIC.Statistics 572 (Spring 2007) Multiple Linear Regression February 20, 2007 9 / 14Variable Selection BICBICSchwartz’s Bayesian Information Criterion (BIC) is similar to AIC butpenalizes additional parameters more.The general form isBIC = −2 log L + (log n)pwhere n is the number of observations, L is the likelihood, and p isthe number of parameters.In multiple regression, this becomesBIC = n logRSSn+ (log n)p + Cwhere RSS is the residual sum of squares and C is a constant.In R, the functions AIC and extractAIC also find BIC setting withthe extra argument k=log(n) where n is the number of observations.The best model by this criterion minimizes BIC.Statistics 572 (Spring 2007) Multiple Linear Regression February 20, 2007 10 / 14Variable Selection ComputingStepwise RegressionIf there are p explanatory variables, we can in principle compute AIC(or BIC) for every possible combination of variables.There are 2psuch models.Instead, we typically begin with a model and attempt to add orremove variables that decrease AIC the most, continuing until nosingle variable change makes an improvement.This process need not find the global best model.It is wise to begin searches from models with both few and manyvariables to see if they finish in the same place.Statistics 572

View Full Document