MIT 9 520 - Reproducing Kernel Hilbert Spaces - D3038882

Home> Schools> Massachusetts Institute of Technology> Brain and Cognitive Sciences (9) > 9 520> Reproducing Kernel Hilbert Spaces

DOC PREVIEW

MIT 9 520 - Reproducing Kernel Hilbert Spaces

School name Massachusetts Institute of Technology

Course 9 520- Statistical Learning Theory and Applications

Pages 8

This preview shows page 1-2-3 out of 8 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 8 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 8 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 8 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 8 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

9.520: Statistical Learning Theory and Applications February 10th, 2010Reproducing Kernel Hilbert SpacesLecturer: Lorenzo Rosasco Scribe: Greg Durrett1 IntroductionIn the previous two lectures, we’ve discussed the connections of the learning problem to statisticalinference. However, unlike in traditional statistics, our primary goal with learning is to predict thefuture rather than describe the data at hand. We also typically have a much smaller sample of datain a much higher-dimensional space, so we cannot blindly choose a model and assume it will beaccurate. If the model is too highly-parameterized, it will react too strongly to the data, we willoverfit the data, and we will fail to learn the underlying phenomenon (see Figure 1 for an example ofthis behavior). However, models with too few parameters may not even describe the training dataadequately, and will provide similarly bad performance.Regularization provides us with one way to strike the appropriate balance in creating our model.It requires a (possibly large) class of models and a method for evaluating the complexity of eachmodel in the class. The concept of “kernels” will provide us with a flexible, computationally feasiblemethod for implementing this scheme.(a) Data set for which we wish tolearn a function(b) Smooth function that will likelybe a good predictor of future points(c) Function that probably does notmodel the data well, but still mini-mizes empirical errorFigure 1: Two different functions learned from a small training set.The goal of these notes will be to introduce a particularly useful family of hypothesis spaces calledreproducing kernel Hilbert spaces (RKHS), each of which is associated with a particular kernel, andto derive the general solution of Tikhonov regularization in RKHS, known as the representer theorem.2 RegularizationThe goal of regularization is to restore the well-posedness (specifically, making the result dependsmoothly on the data) of the empirical risk minimization (ERM) technique by effectively restrictingthe hypothesis space H. One way of doing this is introduce a penalization term in our minimizationas follows:ERR(f)| {z }empirical error+λ pen(f)| {z }penalization term3-1where the regularization parameter λ controls the tradeoff between the two terms. This will thencause the minimization to seek out simpler functions, which incur less of a penalty.Tikhonov regularization can be written in this way, as1nnXi=1V (f(xi), yi) + λkfk2H, (1)where• λ > 0 is a regularization parameter,• V (f(x), y) is the loss function, that is the price we pay when we predict f (x) in place of y, and• k · kHis the norm in the function space H.This formulation is powerful, as it does not present a specific algorithm, but rather a large classof algorithms. By choosing V and H differently, we can derive a wide variety of commonly-usedtechniques, including traditional linear regression and support vector machines (SVMs).Given our intuition about what causes overfitting, the penalization should somehow force us tochoose f to be as smooth as possible while still fitting the data. The norm from our function spaceH will allow us to encode this criterion, but in order to design this norm appropriately, we need todescribe reproducing kernel Hilbert spaces.3 Functional Analysis BackgroundIn order to define RKHS, we will make use of several terms from functional analysis, which wedefine here. Additional review on functional analysis can be found in the notes from the math camp,available on the website.Definition 1 A function space F is a space whose elements are functions, e.g. f : Rd→ R.Definition 2 An inner product is a function h·, ·i : F × F → R that satisfies the followingproperties for every f, g ∈ F and α ∈ R:1. Symmetric: hf, gi = hg, f i2. Linear: hr1f1+ r2f2, gi = r1hf1, gi + r2hf2, gi3. Positive-definite: hf, fi ≥ 0 for all f ∈ F and hf, fi = 0 iff f = 0.Definition 3 A norm is a nonnegative function k · k : F → R such that for all f, g ∈ F and α ∈ R1. kfk ≥ 0 and kf k = 0 iff f = 0;2. kf + gk ≤ kf k + kgk;3. kαfk = |α| kfk.A norm can be defined via an inner product, as kfk =phf, f iNote that while the dot product in Rdis an example of an inner product, an inner product is moregeneral than this, and requires only those properties given above. Similarly, while the Euclideannorm is an example of a norm, we consider a wider class of norms on the function spaces we willuse.3-2Definition 4 A Hilbert space is a complete, (possibly) infinite-dimensional linear space endowedwith an inner product.A norm in H can be naturally defined from the given inner product, as k · k =ph·, ·i. Althoughit is possible to impose a different norm so long as it satisfies the criteria given above, we will not dothis in general; our norm is assumed to be the norm derived from the inner product. Furthermore,we always assume that H is separable (contains a countable dense subset) so that H has a countableorthonormal basis.While this tells us what a Hilbert space is, it is not intuitively clear why we need this mechanism,or what we gain by using it. Essentially, a Hilbert space lets us apply concepts from finite-dimensionallinear algebra to infinite-dimensional spaces of functions. In particular, the fact that a Hilbert spaceis complete will guarantee the convergence of certain algorithms. More importantly, the presenceof an inner product allows us to make use of orthogonality and projections, which will later becomeimportant.3.1 Examples of function spaces• One function space is the space of continuous functions on the interval [a, b], denoted by C[a, b].A norm can be established by definingkfk = maxa≤x≤b|f(x)|However, there is no inner product for the space that induces this norm, so it is not a Hilbertspace.• Another example is square integrable functions on the interval [a, b], denoted by L2[a, b]. Wedefine the inner product ashf, gi =Zbaf(x)g(x)dxThis produces the correct norm:kfk =Zbaf2(x)dxIt can be checked that this space is complete, so it is a Hilbert space. However, there is oneproblem with the functions in this space. Consider trying to evaluate the function f(x) at thepoint x = k. There exists a function g in the space defined as follows:g(x) =(c if x = kf(x) otherwiseBecause it differs from f only at one point, g is clearly still square-integrable, and moreover,kf − gk = 0. However, we can set the constant c (or, more generally, the value of g(x) at anyfinite number of

View Full Document