New version page

# Issues with Bases

## This preview shows page 1-2-3-4-5-37-38-39-40-41-42-74-75-76-77-78 out of 78 pages.

View Full Document
Do you want full access? Go Premium and unlock all 78 pages.
Do you want full access? Go Premium and unlock all 78 pages.
Do you want full access? Go Premium and unlock all 78 pages.
Do you want full access? Go Premium and unlock all 78 pages.
Do you want full access? Go Premium and unlock all 78 pages.
Do you want full access? Go Premium and unlock all 78 pages.
Do you want full access? Go Premium and unlock all 78 pages.
Do you want full access? Go Premium and unlock all 78 pages.
Do you want full access? Go Premium and unlock all 78 pages.
Do you want full access? Go Premium and unlock all 78 pages.
Do you want full access? Go Premium and unlock all 78 pages.
Do you want full access? Go Premium and unlock all 78 pages.
Do you want full access? Go Premium and unlock all 78 pages.
Do you want full access? Go Premium and unlock all 78 pages.
Do you want full access? Go Premium and unlock all 78 pages.
Do you want full access? Go Premium and unlock all 78 pages.

Unformatted text preview:

8. Issues with BasesNonparametric regression often tries to fit a model of the formf(x) =MXj=1βjhj(x) + ǫwhere the hjfunctions may pick out specific components of x ( multiple li nearregression) , or be powers of components of x (polynomi al regression), or beprespecified transformations of co mponents of x (nonlinear regression).Many fun c tions f cannot be repres ented by the kinds of hjlisted above. But i f theset {hj} is an orthogona l basis for a spa c e of fu nctions that contains f, then we canexploit many st andard st rategies from linear regression.18.1 Hilbert SpacesUsually we are conce rn ed w i th spaces of functions such as L2[a, b], the Hilbert spac eof all real-valued functi ons defi ned on the interval [a, b] th at are square-integrable:Zbaf(x) dx < ∞.This d efinition extends to functions on IRp.Hilbert spaces have an inner product. For L2[a, b] it is< f, g >=Zbaf(x)g(x) dx.The i nner product defines a norm kfk, given by < f, g >1/2, which is e ssentially ametric on the space of functions.There are addi tional issues for a Hilbert space, such as completeness (i.e . , the spacecontains th e lim i t of all Cauchy seque nces), but we can ignore t hose.2A set of functions {hj} in a H i l bert space is mutually orthogonal if for a l l j 6= k,< hj, hk>= 0.Additionally, if khjk = 1 for all j, then the set i s orthonormal.If {hj} is an orthonormal basis for a spa c e H then every function f ∈ H can beuniquely written as:f(x) =∞Xj=1βjhj(x)whereβj=< f, hj> .Some famous orthogonal bases include th e Fourier bases, consisting of {cos nx, sin nx},wavele t s, Legendre polynom i als, and Hermite pol ynomials.3If one has an orthonormal ba sis for the space in which f lives, th e n several nice thingshappen:1. Since f is a linear functi on o f the basis ele ments, then sim ple regression methodsallow us to esti mate the coeffici e nts βj.2. Since the basis i s orthogonal, the estima tes of different βjcoefficients areindependent.3. Since the set is a basis, there is no identifiabil i ty problem; each f unction in H isuniquely expressed as a weighted s um of ba sis elements.But not a l l orthonormal bas es are equa l l y good. If one can find a basis that i ncludesf itself, th at would be ide al. Second best is a basis in which only a few elements inthe representation of f have non-zero coefficients. But th i s problem i s tautologicalsince we do not know f.4One approa ch to choosing a basis is to use Gram- Schmidt orthogonal ization toconstruct a basis set such that the first few elements correspond to the kinds offunctions t hat statistici ans expect to encounter.This approach is often practical if the rig ht kind of doma i n knowledge is available.This is often th e case in a udio s i gnal p rocessing; Fourier serie s are na t ural ways torepresent vibrat i o n.But i n gene ra l , one must pick basis set without regard to f. A common criterion isthat the influence of the basis elements be local. ¿From this standpoint, polynomia l sand trig onometric functions are bad, be c ause their support is the whole line, butsplines and wavele t s are good , because t heir support i s es sentially compact.5Given an orthonormal basis, one can try estimating all of the {βj}. But one quicklyruns out of data—even if f is exactly equal to one of the basis elem ents, noise ensuresthat all of the other elements will make some contribution to the fit.To a ddress this problem one can:• Restrict the s et of functions, so i t is no l o nger a basis (e.g., linear regres sion);• select only those basis elem e nts that seem t o contribute significantly to the fit ofthe model ( e . g., variable selection methods, or gre edy fitters like MARS, CART,and boos ting);• regularize, to restrict the va l ues of the coefficients (e.g., through ridge regressio nor shrinkage or th esholding ) .The fi rs t technique is problemat i c, especia l l y since it prevents flexiblity in fittingnonlinear functions .The second technique is important, espe cially when one h a s large p but small n.But it is of ten ins ufficient, an d the theory for agg re ssive variable selectio n is notwell-developed yet.6Elements of Statistial LearningHastie, Tibshirani & Friedman 2001 Chapter 7RealizationClosest fit in populationEstimation BiasSPACEVarianceEstimationClosest fitTruthModel biasRESTRICTEDShrunken fitMODEL SPACEMODELFigure 7.2:Shemati of the behavior of bias and variane.The model spae is the set of al l possible preditions from themodel, with the \losest t" labeled with a blak dot. Themodel bias from the truth is shown, along with the variane,indiated by the large yel low irle entered at the blak dotlabel led \losest t in population". A shrunken or regular-ized t is also shown, having additional estimation bias, butsmal ler predition error due to its dereased variane.78.2 Ridge RegressionRidge regression is an old idea, used originally to protect aga i nst multicollinearity. Itshrinks the coefficients in a reg ression towards zero (and each oth e r) by im posing apenalty on the sum of the squares of the coefficients.ˆβ = argminβ{nXI=1(yI− β0−pXj=1βjxij)2+ λpXj=1β2j}where λ ≥ 0 is a penalty pa ra meter that controls th e amount of shrinkage.Recall Stein’s result that when estimat i ng a multivariate normal mea n with squarederror l oss, the sa mple average is in admissib l e and can be improved by shi nking theestimato r towards 0 (or any oth er value).Neural net methods now often do simi l ar shrinkage on the weights at each node,Thereby improving p redictive square d accuracy.8Elements of Statistial LearningHastie, Tibshirani & Friedman 2001 Chapter 3-4 -2 0 2 4-4 -2 0 2 4oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooLargest PrincipalComponentSmallest PrincipalComponentPSfrag replaementsX1X2Figure 3.8:Prinipal omponents of some input datapoints. The largest prinipal omponent is the dire-tion that maximizes the variane of the projeted data,and the smal lest prinipal omponent minimizes thatvariane. Ridge regression projetsyonto these om-ponents, and then shrinks the oeÆients of the low-variane omponents more than the high-variane om-ponents.9Ridge regression methods are not eq uivariant under sca l i ng, so