DOC PREVIEW
MIT 9 520 - Online Learning

This preview shows page 1-2-15-16-17-32-33 out of 33 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 33 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 33 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 33 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 33 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 33 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 33 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 33 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 33 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Online Learning9.520 Class 12, 20 March 2006Andrea Caponnetto, Sanmay DasAbout this classGoal To introduce the general setting of online learning.To describe an online version of the RLS algorithm and analyzeits performance.To discuss convergence results of the classical Perceptron algo-rithm.To introduce the “experts” framework and prove mistake boundsin that framework.To show the relationship between online learning and the theoryof learning in games.What is online learning?Sample data are arranged in a sequence.Each time we get a new input, the algorit hm tries to predict thecorresponding output.As the number of seen samples increases, hopefully the predictionsimprove.Assets1. does not require storing all data samples2. typically fast algorithms3. it is possible to give formal guarantees not assuming probabilis-tic hypotheses (mistakes bounds)but...• performance can be worse than best batch algorithms• generalization bounds always require some assumption on thegeneration of sample dataOnline settingSequence of sample data z1, z2, . . . , zn.Each sample is an input-output couple zi= (xi, yi).xi∈ X ⊂ IRd, yi∈ Y ⊂ IR.In the classification case Y = {+1, −1}, in the regression case Y =[−M, M].Loss function V : IR × Y → IR+(e.g. E(w, y) = Θ(−yw) andV (w, y) = |1 − yw|+).Estimators fi: X → Y constructed using the first i data samples.Online setting (cont.)• initialization f0• for i = 1, 2, . . . , n• receive xi• predict fi−1(xi)• receive yi• update (fi−1, zi) → fiNote: storing efficiently fi−1may require much less memory thanstoring all previous samples z1, z2, . . . , zi−1.GoalsBatch learning: reducing expected lossI[fn] = IEzV (fn(x), y)Online learning: reducing cumulative lossnXi=1V (fi−1(xi), yi)Online implementation of RLS∗update rule: For some choice of the sequences of positive para-meters γiand λi,fi= fi−1− γi(fi−1(xi) − yi)Kxi+ λifi−1,where K : X × X → IR is a pd kernel and for every x ∈ X, Kx(x′) =K(x, x′).Note: this rule has a simple justification assuming that the samplepoints (xi, yi) are i.i.d. from a probability distribution ρ.∗Online learning algorithms. Smale, Yao. 05Interpretation of online RLSFor sake of simplicity, let us set λt= λ > 0.We would like to estimate the ideal regularized least-squares esti-mator fλρfλρ= arg minf∈HKZX×Y(f(x) − y)2dρ(z) + kfk2K.From the definition above it can be showed that fλρalso satisfiesIEz∼ρ[(fλρ(x) − y)Kx+ λfλρ] = 0,therefore, fλρis also the equilibrium point of the averaged onlineupdate equationfi= fi−1− γiIEzi∼ρ[(fi−1(xi) − yi)Kxi+ λfi−1].Generalization bound for online algorithm∗Theorem: Let fρbe the minimizer of the expected squared lossI[f] (i.e. the regression function). Assume K(x, x) ≤ κ for somepositive constant κ, and L−rKfρ∈ L2(X, ρX) for some r ∈ [1/2, 1].Then letting γi= c1i−2r2r+1and λi= c2i−12r+1for some constants c1and c2, with probability greater than 1 − δ, for all i ∈ IN it holdsI[fi] ≤ I[fρ] + Ci−2r2r+1,where C depends on M, κ, r, kL−rKfρkρand δ.Note: the rates of convergence Oi−2r2r+1are the best theoreticallyattainable under these assumptions.∗Online learning as stochastic approximations of regularization paths. Tarres,Yao. 05The Perceptron AlgorithmWe consider the classification problem: Y = {−1, +1}.We deal with linear estimators fi(x) = ωi· x, with ωi∈ IRd.The 0-1 loss E(fi(x), y) = Θ(−y(ωi· x)) is the natural choice inthe classification context. We will also consider the more tractablehinge-lossV (fi(x), y) = |1 − y(ωi· x)|+.Update rule:If Ei= E(fi−1(xi), yi) = 0 then ωi= ωi−1, otherwiseωi= ωi−1+ yixiThe Perceptron Algorithm (cont.)Passive-Aggressive strategy of the update rule.If fi−1classifies correctly xi, don’t move.If fi−1classifies incorrectly, try to increase the margin yi(ω ·xi). Infact,yi(ωi· xi) = yi(ωi−1· xi) + y2ikxik2> yi(ωi−1· xi)Perceptron Convergence Theorem∗Theorem: If the samples z1, . . . , znare linearly separable, then pre-senting them cyclically to the Perceptron algorithm, the sequenceof weight vectors ωiwill eventually converge.We will proof a more general result encompassing both the separableand the inseparable cases∗Pattern Classification. Duda, Hart, Stork, 01Mistakes’ Bound∗Theorem: Assume kxik ≤ R for every i = 1, 2, . . . , n. Then forevery u ∈ IRdM ≤Rkuk +vuutnXi=1ˆV2i2,whereˆVi= V (u · xi, yi) and M is the total number of mistakes:M =Pni=1Ei=Pni=1E(fi−1(xi), yi).∗Online Passive-Aggressive Algorithms. Crammer et al, 03Mistakes’ Bound (cont.)• the boundedness conditions kxik ≤ R is necessary.• in the separable case, there exists u∗inducing margins yi(u∗·xi) ≥1, and therefore null “batch” loss over sample points. TheMistakes’ Bound becomesM ≤ R2ku∗k2.• in the inseparable case, we can let u be the best possible linearseparator. The bound compares the online performance withthe best batch performance over a given class of competitors.ProofThe terms ωi· u increase as i increases1. If Ei= 0 then ωi· u = ωi−1· u2. If Ei= 1, sinceˆVi= |1 − yi(xi· u)|+,ωi· u = ωi−1· u + yi(xi· u) ≥ ωi−1· u + 1 −ˆVi.3. Hence, in both cases ωi· u ≥ ωi−1· u + (1 −ˆVi)Ei4. Summing up, ωn· u ≥ M −Pni=1ˆViEi.Proof (cont.)The terms kωik do not increase too quickly1. If Ei= 0 then kωik2= kωi−1k22. If Ei= 1, since yi(ωi−1· xi) ≤ 0,kωik2= (ωi−1+ yixi) · (ωi−1+ yixi)= kωi−1k2+ kxik2+ 2yi(ωi−1· xi) ≤ kωi−1k2+ R2.3. Summing up, kωnk2= MR2.Proof (cont.)Using the estimates for ωn·u and kωnk2, and applying Cauchy-Schwartz inequality1. By C-S, ωn· u ≤ kωnkkuk, henceM −nXi=1ˆViEi≤ ωn· u ≤ kωnkkuk ≤√MRkuk2. Finally, by C -S,Pni=1ˆViEi≤qPni=1ˆV2iqPni=1E2i, hence√M −vuutnXi=1ˆV2i≤ Rkuk.The Experts FrameworkWe will focus on the classification case.Suppose we have a pool of prediction strategies, called experts.Denote by E = {E1, . . . , En}.Each expert predicts yibased on xi.We want to combine these experts to produce a single master al-gorithm for classification and prove bounds on how much worse itis than the best expert.The Halving Algorithm∗Suppose all the experts are functions (their predictions for a pointin the space do not change over time) and at least one of them isconsistent with the data.At each step, predict what the


View Full Document

MIT 9 520 - Online Learning

Documents in this Course
Load more
Download Online Learning
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Online Learning and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Online Learning 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?