DOC PREVIEW
MIT 9 520 - Approximation Theory

This preview shows page 1-2-15-16-17-32-33 out of 33 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 33 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 33 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 33 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 33 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 33 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 33 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 33 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 33 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Approximation TheoryReferencesOutlineNotationExample Hypothesis SpacesCalculating approximation ratesTarget SpaceHypothesis SpaceApproximation RateCurse of dimensionalityHard LimitsN-widthsMultivariate Example“Dimension Free” convergenceMaurey-Barron-Jones LemmaMaurey-type Approximation SchemesHidden SmoothnessAlgorithmic difficultyRandom FeaturesRandom Features: ExampleGeneralization ErrorKernelsRandom Features for ClassificationGaussian RKHS vs Random FeaturesApproximation TheoryBen RechtCenter for the Mathematics of InformationCaltechApril 7, 2008References• The majority of this material is adapted from F. Girosi’s9.520 lecture from 2003.–Available on OCW– Very readable with an extensive bibliography• Random Features– Ali Rahimi and Benjamin Recht. “Random Features for Large-Scale Kernel Machines.” NIPS 2007– Ali Rahimi and Benjamin Recht. “On the power of randomized shallow belief networks.” In preparation, 2008.Outline• Decomposition of the generalization error• Approximation and rates of convergence• “Dimension Independent” convergence rates• Maurey-Barron-Jones approximations• Random FeaturesNotationhow well we can dohow well we can do in Hhow well we can do in Hwith our L observationsGeneralization ErrorEstimation Error Approximation ErrorFor least squares costEstimation Error Approximation ErrorIndependent of target space(statistics)Independent of examples(analysis)Judiciously select H to balance the tradeoff• Nested hypothesis spaces• ErrorFor most families of hypothesis spaces we encounter• How fast does this error go to zero? We are interested in bounds of the formExample Hypothesis Spaces• Polynomials on [0,1]. Hnis the set of all polynomials with degree at most nWe can approximate any smooth function with a polynomial (Taylor series).• Sines and cosines on [-π,π]. We can approximate any square integrable function with a Fourier series.Calculating approximation rates• Functions in this class can be represented by•Parseval:Target Space• Sobolev space of smooth functions•Using parseval:Hypothesis Space• Hnis the set of trig functions of degree n• If f is of the formBest approximation in L2norm by Hnis given byApproximation Rate• Note that Hnhas n parameters. How fast does go to zero?• More smoothness, faster convergence• What happens in higher dimension?• Functions can be written• Target space• Again by Parseval• Hypothesis Space. Ht• Number of parameters in Htis n = td. Best approximation to f is given by• How fast does go to zero? We do the calculation for d=2:• Now the approximation scales as (as a function of n):Curse of dimensionalityBlessing of smoothnessCurse of dimensionality• Provides an estimate for the number of parameters• Is this upper bound very loose?Hard Limits• Tommi Poggio: just remember Nyquist….Sample rate = 2 x max freqNum samples = 2 x T x max freqIn dimension d: Num samples = (2 x T x max freq)dTTN-widths• Let X be a normed space of functions. Let A be a subset of X. We want to approximate A with a linear combination of a finite set of “basis functions” X.• Kolmogorov N-widths let us quantify how well we could do over all choices of finite sets of basis functions.The n-width of A in XMultivariate Example• Theorem (from Pinkus 1980):This rate is achieved by spliness times differentiablesth derivative in L2“Dimension Free” convergence• Consider networks of the form• “Shallow” networks with parametric basis functions• Characterize when we can get good approximationsMaurey-Barron-Jones Lemma• Theorem: If f is in the convex hull of a set G in a Hilbert Space with ||g||2≤b for all g ∈ G, then for every n≥1 and every c’>b2-||f||22, there is an fnin the convex hull of n points in G such that• Also known as Maurey’s “empirical method”• Many uses in computing covering numbers (see, e.g., generalization bounds, random matrices, compressive sampling, etc.)Maurey-type Approximation Schemes• Jones (1992)• Barron (1993)• Girosi & Anzellotti (1995)• Using nearly identical analysis, all of these schemes achieve DefineHidden Smoothness• Barron hides the smoothness via the functional• Girosi and Anzellotti show that this means• Note: functions get smoother as d increasesfor someAlgorithmic difficulty• Training these networks is hard• But for fixed θk, the following is almost always trivial:• How to avoid optimizing the θk?Random Features• What happens if we pick θkat random and then optimize the weights?• It turns out, with some a priori information about the frequency content of f, we can do just as well as the classical approximation results of Maurey and co.• Fix parameterized basis functions• Fix a probability distribution• Our target space will be:• With the convention thatRandom Features: Example• Fourier basis functions:• Gaussian parameters• If , then meansthat the frequency distribution of f has subgaussian tails.• Theorem: Let f be in FpwithLet ω1,…, ωnbe sampled iid from p. Thenwith probability at least 1 - δ.Generalization ErrorEstimation Error Approximation Error• It’s a finite sized basis set!• Choosing gives overall convergence ofKernels• Note that under the mappingwe have• Ridge regression with random features approximates Tikhonov regularized least-squares on an RKHSRandom Features for ClassificationGaussian RKHS vs Random Features• Random Features are good: when L is sufficiently large and the function is sufficiently smooth• TR on RKHS is good: when L is small or the function is not so smooth% Approximates Gaussian Process regression % with Gaussian kernel of variance gamma% lambda: regularization parameter % dataset: X is dxN, y is 1xN% test: xtest is dx1% D: dimensionality of random feature % trainingw = randn(D, size(X,1));b = 2*pi*rand(D,1);Z = cos(sqrt(gamma)*w*X + repmat(b,1,size(X,2)));% Equivalent to% alpha = (lambda*eye(size(X,2)+Z*Z')\(Z*y);alpha = symmlq(@(v)(lambda*v(:) + *(Z'*v(:))),…Z*y(:),1e-6,2000);% testingztest = alpha(:)’*cos( sqrt(gamma)*w*xtest(:) + …+ repmat(b,1,size(X,2))


View Full Document

MIT 9 520 - Approximation Theory

Documents in this Course
Load more
Download Approximation Theory
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Approximation Theory and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Approximation Theory 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?