8. Issues with BasesNonparametric regression often tries to fit a model of the formf(x) =MXj=1βjhj(x) + ǫwhere the hjfunctions may pick out specific components of x ( multiple li nearregression) , or be powers of components of x (polynomi al regression), or beprespecified transformations of co mponents of x (nonlinear regression).Many fun c tions f cannot be repres ented by the kinds of hjlisted above. But i f theset {hj} is an orthogona l basis for a spa c e of fu nctions that contains f, then we canexploit many st andard st rategies from linear regression.18.1 Hilbert SpacesUsually we are conce rn ed w i th spaces of functions such as L2[a, b], the Hilbert spac eof all real-valued functi ons defi ned on the interval [a, b] th at are square-integrable:Zbaf(x) dx < ∞.This d efinition extends to functions on IRp.Hilbert spaces have an inner product. For L2[a, b] it is< f, g >=Zbaf(x)g(x) dx.The i nner product defines a norm kfk, given by < f, g >1/2, which is e ssentially ametric on the space of functions.There are addi tional issues for a Hilbert space, such as completeness (i.e . , the spacecontains th e lim i t of all Cauchy seque nces), but we can ignore t hose.2A set of functions {hj} in a H i l bert space is mutually orthogonal if for a l l j 6= k,< hj, hk>= 0.Additionally, if khjk = 1 for all j, then the set i s orthonormal.If {hj} is an orthonormal basis for a spa c e H then every function f ∈ H can beuniquely written as:f(x) =∞Xj=1βjhj(x)whereβj=< f, hj> .Some famous orthogonal bases include th e Fourier bases, consisting of {cos nx, sin nx},wavele t s, Legendre polynom i als, and Hermite pol ynomials.3If one has an orthonormal ba sis for the space in which f lives, th e n several nice thingshappen:1. Since f is a linear functi on o f the basis ele ments, then sim ple regression methodsallow us to esti mate the coeffici e nts βj.2. Since the basis i s orthogonal, the estima tes of different βjcoefficients areindependent.3. Since the set is a basis, there is no identifiabil i ty problem; each f unction in H isuniquely expressed as a weighted s um of ba sis elements.But not a l l orthonormal bas es are equa l l y good. If one can find a basis that i ncludesf itself, th at would be ide al. Second best is a basis in which only a few elements inthe representation of f have non-zero coefficients. But th i s problem i s tautologicalsince we do not know f.4One approa ch to choosing a basis is to use Gram- Schmidt orthogonal ization toconstruct a basis set such that the first few elements correspond to the kinds offunctions t hat statistici ans expect to encounter.This approach is often practical if the rig ht kind of doma i n knowledge is available.This is often th e case in a udio s i gnal p rocessing; Fourier serie s are na t ural ways torepresent vibrat i o n.But i n gene ra l , one must pick basis set without regard to f. A common criterion isthat the influence of the basis elements be local. ¿From this standpoint, polynomia l sand trig onometric functions are bad, be c ause their support is the whole line, butsplines and wavele t s are good , because t heir support i s es sentially compact.5Given an orthonormal basis, one can try estimating all of the {βj}. But one quicklyruns out of data—even if f is exactly equal to one of the basis elem ents, noise ensuresthat all of the other elements will make some contribution to the fit.To a ddress this problem one can:• Restrict the s et of functions, so i t is no l o nger a basis (e.g., linear regres sion);• select only those basis elem e nts that seem t o contribute significantly to the fit ofthe model ( e . g., variable selection methods, or gre edy fitters like MARS, CART,and boos ting);• regularize, to restrict the va l ues of the coefficients (e.g., through ridge regressio nor shrinkage or th esholding ) .The fi rs t technique is problemat i c, especia l l y since it prevents flexiblity in fittingnonlinear functions .The second technique is important, espe cially when one h a s large p but small n.But it is of ten ins ufficient, an d the theory for agg re ssive variable selectio n is notwell-developed yet.6Elements of Statistial LearningHastie, Tibshirani & Friedman 2001 Chapter 7RealizationClosest fit in populationEstimation BiasSPACEVarianceEstimationClosest fitTruthModel biasRESTRICTEDShrunken fitMODEL SPACEMODELFigure 7.2:Shemati of the behavior of bias and variane.The model spae is the set of al l possible preditions from themodel, with the \losest t" labeled with a blak dot. Themodel bias from the truth is shown, along with the variane,indiated by the large yel low irle entered at the blak dotlabel led \losest t in population". A shrunken or regular-ized t is also shown, having additional estimation bias, butsmal ler predition error due to its dereased variane.78.2 Ridge RegressionRidge regression is an old idea, used originally to protect aga i nst multicollinearity. Itshrinks the coefficients in a reg ression towards zero (and each oth e r) by im posing apenalty on the sum of the squares of the coefficients.ˆβ = argminβ{nXI=1(yI− β0−pXj=1βjxij)2+ λpXj=1β2j}where λ ≥ 0 is a penalty pa ra meter that controls th e amount of shrinkage.Recall Stein’s result that when estimat i ng a multivariate normal mea n with squarederror l oss, the sa mple average is in admissib l e and can be improved by shi nking theestimato r towards 0 (or any oth er value).Neural net methods now often do simi l ar shrinkage on the weights at each node,Thereby improving p redictive square d accuracy.8Elements of Statistial LearningHastie, Tibshirani & Friedman 2001 Chapter 3-4 -2 0 2 4-4 -2 0 2 4oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooLargest PrincipalComponentSmallest PrincipalComponentPSfrag replaementsX1X2Figure 3.8:Prinipal omponents of some input datapoints. The largest prinipal omponent is the dire-tion that maximizes the variane of the projeted data,and the smal lest prinipal omponent minimizes thatvariane. Ridge regression projetsyonto these om-ponents, and then shrinks the oeÆients of the low-variane omponents more than the high-variane om-ponents.9Ridge regression methods are not eq uivariant under sca l i ng, so
or
We will never post anything without your permission.
Don't have an account? Sign up