UF STA 6127 - Regression Model Building - D220340

Home> Schools> University of Florida-Gainesville> (STA) > STA 6127> Regression Model Building

UF STA 6127 - Regression Model Building

School name University of Florida-Gainesville

Course Sta 6127- Statistical Methods in Social Research II

Pages 15

Download Save

Unformatted text preview:

Regression Model BuildingBackward EliminationForward SelectionStepwise RegressionAll Possible Regressions - CpRegression DiagnosticsDetecting Influential ObservationsDetecting Influential ObservationsObtaining Influence Statistics and Studentized Residuals in SPSSVariance Inflation FactorsNonlinearity: Polynomial RegressionGeneralized Linear Models (GLM)Random ComponentCommon Link FunctionsExponential Regression ModelsRegression Model Building• Setting: Possibly a large set of predictor variables (including interactions).• Goal: Fit a parsimonious model that explains variation in Y with a small set of predictors• Automated Procedures and all possible regressions:– Backward Elimination (Top down approach)– Forward Selection (Bottom up approach)– Stepwise Regression (Combines Forward/Backward)– CpStatistic - Summarizes each possible model, where “best” model can be selected based on statisticBackward Elimination• Select a significance level to stay in the model (e.g. SLS=0.20, generally .05 is too low, causing too many variables to be removed)• Fit the full model with all possible predictors• Consider the predictor with lowest t-statistic (highest P-value).– If P > SLS, remove the predictor and fit model without this variable (must re-fit model here because partial regression coefficients change)– If P ≤ SLS, stop and keep current model• Continue until all predictors have P-values below SLSForward Selection• Choose a significance level to enter the model (e.g. SLE=0.20, generally .05 is too low, causing too few variables to be entered)• Fit all simple regression models.• Consider the predictor with the highest t-statistic (lowest P-value)– If P ≤ SLE, keep this variable and fit all two variable models that include this predictor– If P > SLE, stop and keep previous model• Continue until no new predictors have P ≤ SLEStepwise Regression• Select SLS and SLE (SLE<SLS)• Starts like Forward Selection (Bottom up process)• New variables must have P ≤ SLE to enter• Re-tests all “old variables” that have already been entered, must have P ≤ SLS to stay in model• Continues until no new variables can be entered and no old variables need to be removedAll Possible Regressions - Cp• Fits every possible model. If K potential predictor variables, there are 2K-1 models.• Label the Mean Square Error for the model containing all K predictors as MSEK• For each model, compute SSE and Cp where p is the number of parameters (including intercept) in model)2( pnMSESSECKp−−=• Select the model with the fewest predictors that has Cp≈ pRegression Diagnostics• Model Assumptions:– Regression function correctly specified (e.g. linear)– Conditional distribution of Y is normal distribution– Conditional distribution of Y has constant standard deviation– Observations on Y are statistically independent• Residual plots can be used to check the assumptions– Histogram (stem-and-leaf plot) should be mound-shaped (normal)– Plot of Residuals versus each predictor should be random cloud• U-shaped (or inverted U) ⇒ Nonlinear relation• Funnel shaped ⇒ Non-constant Variance– Plot of Residuals versus Time order (Time series data) should berandom cloud. If pattern appears, not independent.Detecting Influential Observations♦ Studentized Residuals – Residuals divided by their estimated standard errors (like t-statistics). Observations with values larger than 3 in absolute value are considered outliers.♦ Leverage Values (Hat Diag) – Measure of how far an observation is from the others in terms of the levels of the independent variables (not the dependent variable). Observations with values larger than 2(k+1)/n are considered to be potentially highly influential, where k is the number of predictors and n is the sample size.♦ DFFITS – Measure of how much an observation has effected its fitted value from the regression model. Values larger than 2*sqrt((k+1)/n) in absolute value are considered highly influential. Use standardized DFFITS in SPSS.Detecting Influential Observations♦ DFBETAS – Measure of how much an observation has effected the estimate of a regression coefficient (there is one DFBETA for each regression coefficient, including the intercept). Values larger than 2/sqrt(n) in absolute value are considered highly influential.♦ Cook’s D – Measure of aggregate impact of each observation on the group of regression coefficients, as well as the group offitted values. Values larger than 4/n are considered highly influential.♦ COVRATIO – Measure of the impact of each observation on the variances (and standard errors) of the regression coefficients and their covariances. Values outside the interval 1 +/- 3(k+1)/n are considered highly influential.Obtaining Influence Statistics and Studentized Residuals in SPSS–.Choose ANALYZE, REGRESSION, LINEAR, and input the Dependent variable and set of Independent variables from your model of interest (possibly having been chosen via an automated model selection method).•.Under STATISTICS, select Collinearity Diagnostics, Casewise Diagnostics and All Cases and CONTINUE•.Under PLOTS, select Y:*SRESID and X:*ZPRED. Also choose HISTOGRAM. These give a plot of studentized residuals versus standardizedpredicted values, and a histogram of standardized residuals (residual/sqrt(MSE)). Select CONTINUE.•.Under SAVE, select Studentized Residuals, Cook’s, Leverage Values, Covariance Ratio, Standardized DFBETAS, Standardized DFFITS. Select CONTINUE. The results will be added to your original data worksheet.Variance Inflation Factors• Variance Inflation Factor (VIF) – Measure of how highly correlated each independent variable is with the other predictors in the model. Used to identify Multicollinearity.• Values larger than 10 for a predictor imply large inflation of standard errors of regression coefficients due to this variable being in model.• Inflated standard errors lead to small t-statistics for partial regression coefficients and wider confidence intervalsNonlinearity: Polynomial Regression• When relation between Y and X is not linear, polynomial models can be fit that approximate the relationship within a particular range of X• General form of model:kkXXYEββα+++= L1)(• Second order model (most widely used case, allows one “bend”):221)( XXYEββα++=• Must be very careful not to extrapolate beyond observed X levelsGeneralized Linear Models

View Full Document


School:
Email:
New Password:
Confirm Password:

UF STA 6127 - Regression Model Building

Sign up for free to view:

Please select your school