**Unformatted text preview:**

1This research was supported by the E. Morris Cox Endowment. I thank David Freedman, Tom Rothenberg, and JamesPowell for useful discussion, and two anonymous referees for helpful comments.ON SELECTING REGRESSORS TO MAXIMIZE THEIR SIGNIFICANCEDaniel McFadden1Department of Economics, University of California, Berkeley CA 94720-3880e-mail: [email protected] 8, 1997 (Revised July 28, 1998)ABSTRACT: A common problem in applied regression analysis is to select the variables that entera linear regression. Examples are selection among capital stock series constructed with differentdepreciation assumptions, or use of variables that depend on unknown parameters, such as Box-Coxtransformations, linear splines with parametric knots, and exponential functions with parametricdecay rates. It is often computationally convenient to estimate such models by least squares, withvariables selected from possible candidates by enumeration, grid search, or Gauss-Newton iterationto maximize their conventional least squares significance level; term this method Prescreened LeastSquares (PLS). This note shows that PLS is equivalent to direct estimation by non-linear leastsquares, and thus statistically consistent under mild regularity conditions. However, standard errorsand test statistics provided by least squares are biased. When explanatory variables are smooth inthe parameters that index the selection alternatives, Gauss-Newton auxiliary regression is aconvenient procedure for obtaining consistent covariance matrix estimates. In cases wheresmoothness is absent or the true index parameter is isolated, covariance matrix estimates obtainedby kernel-smoothing or bootstrap methods appear from examples to be reasonably accurate forsamples of moderate size.KEYWORDS: Variable_Selection, Bootstrapping, Structural_Breaks1ON SELECTING REGRESSORS TO MAXIMIZE THEIR SIGNIFICANCEDaniel McFadden1. IntroductionOften in applied linear regression analysis one must select an explanatory variable from a set ofcandidates. For example, in estimating production functions one must select among alternativemeasures of capital stock constructed using different depreciation assumptions. Or, in hedonicanalysis of housing prices, one may use indicator or ramp variables that measure distance fromspatial features such as parks or industrial plants, with cutoffs at distances that are determined asparameters. In the second example, the problem can be cast as one of nonlinear regression.However, when there are many linear parameters in the regression, direct nonlinear regression canbe computationally inefficient, with convergence problematic. It is often more practical to approachthis as a linear regression problem with variable selection.This paper shows that selecting variables in a linear regression to maximize their conventionalleast squares significance level is equivalent to direct application of non-linear least squares. Thus,this method provides a practical computational shortcut that shares the statistical properties of thenonlinear least squares solution. However, standard errors and test statistics produced by leastsquares are biased by variable selection, and are often inconsistent. I give practical consistentestimators for covariances and test statistics, and show in examples that kernel-smoothing orbootstrap methods appear to give adequate approximations in samples of moderate size.Stated formally, the problem is to estimate the parameters of the linear model(1) y = X + Z( ) + u, 0 ,where y is n×1, X is an n×p array of observations on fixed explanatory variables, Z = Z( ) is an n×qarray of observations on selected explanatory variables, where indexes candidates from a set ofalternatives , and u is an n×1 vector of disturbances with a scalar covariance matrix. Let k = p +q, and assume f ßh. The set is finite in the traditional problem of variable selection, but will bea continuum for parametric data transformations. Assume the data in (1) are generated byindependent random sampling from a model y = x"o + z(o,w)"o + u, where (y,x,w) 0 ß×ßp×ßm isan observed data vector, z: ×ßm xv ßq is a "well-behaved" parametric transformation of the data w,(o,o,o) denote the true parameter values, and the distribution of u is independent of x and w, andhas mean zero and variance o2. There may be overlapping variables in w and x. Examples ofparametric data transformations are (a) a Box-Cox transformation z( ,w) = w-1/( -1) for ú 0 andz(0,w) = log(w); (b) a ramp (or linear spline) function z( ,w) = Max( -w,0) with a knot at ; (c) astructural break z( ,w) = 1(w< ) with a break at ; and (d) an exponential decay z( ,w) = e- w.2One criterion for variable selection is to pick a c 0 that maximizes the conventional leastsquares test statistic for the significance of the resulting Z(c) variables, using enumeration, a gridsearch, or Gauss-Newton iteration, and then pick the least-squares estimates (a,b,s2) of (o,o,o2) in(1) using the selected Z(c). As a shorthand, term this the Prescreened Least Squares (PLS) criterionfor estimating (1). I will show that PLS is equivalent to selecting Z( ) to maximize R2, and is alsoequivalent to estimating ( , , ,2) in (1) jointly by nonlinear least squares. Hence, PLS shares thelarge-sample statistical properties of nonlinear least squares. However, standard errors and teststatistics for a and b that are provided by least squares at the selected Z(c) fail to account for theimpact of variable selection, and will usually be biased downward. When is a continuum, aGauss-Newton auxiliary regression associated with the nonlinear least squares formulation of theproblem can be used in many cases to obtain consistent estimates of standard errors and teststatistics. When is finite, the effects of variable selection will be asymptotically negligible, butleast squares estimates of standard errors will be biased downward in finite samples.Let M = I - X(XNX)-1XN, and rewrite the model (1) in the form(2) y = X[ + (XNX)-1XNZ( ) ] + MZ( ) + u .The explanatory variables X and MZ( ) in (2) are orthogonal by construction, so that the sum ofsquared residuals satisfies (3) SSR( ) = yNMy - yNMZ( )[Z( )NMZ( )]-1Z( )NMy .Then, the estimate c that minimizes SSR() for 0 also maximizes the expression(4) S() / n-1"yNMZ( )[Z( )NMZ( )]-1Z( )NMy .The nonlinear least squares estimators for , , and 2 can be obtained by applying least