Online Learning9.520 Class 12, 20 March 2006Andrea Caponnetto, Sanmay DasAbout this classGoal To introduce the general setting of online learning.To describe an online version of the RLS algorithm and analyzeits performance.To discuss convergence results of the classical Perceptron algo-rithm.To introduce the “experts” framework and prove mistake boundsin that framework.To show the relationship between online learning and the theoryof learning in games.What is online learning?Sample data are arranged in a sequence.Each time we get a new input, the algorit hm tries to predict thecorresponding output.As the number of seen samples increases, hopefully the predictionsimprove.Assets1. does not require storing all data samples2. typically fast algorithms3. it is possible to give formal guarantees not assuming probabilis-tic hypotheses (mistakes bounds)but...• performance can be worse than best batch algorithms• generalization bounds always require some assumption on thegeneration of sample dataOnline settingSequence of sample data z1, z2, . . . , zn.Each sample is an input-output couple zi= (xi, yi).xi∈ X ⊂ IRd, yi∈ Y ⊂ IR.In the classification case Y = {+1, −1}, in the regression case Y =[−M, M].Loss function V : IR × Y → IR+(e.g. E(w, y) = Θ(−yw) andV (w, y) = |1 − yw|+).Estimators fi: X → Y constructed using the first i data samples.Online setting (cont.)• initialization f0• for i = 1, 2, . . . , n• receive xi• predict fi−1(xi)• receive yi• update (fi−1, zi) → fiNote: storing efficiently fi−1may require much less memory thanstoring all previous samples z1, z2, . . . , zi−1.GoalsBatch learning: reducing expected lossI[fn] = IEzV (fn(x), y)Online learning: reducing cumulative lossnXi=1V (fi−1(xi), yi)Online implementation of RLS∗update rule: For some choice of the sequences of positive para-meters γiand λi,fi= fi−1− γi(fi−1(xi) − yi)Kxi+ λifi−1,where K : X × X → IR is a pd kernel and for every x ∈ X, Kx(x′) =K(x, x′).Note: this rule has a simple justification assuming that the samplepoints (xi, yi) are i.i.d. from a probability distribution ρ.∗Online learning algorithms. Smale, Yao. 05Interpretation of online RLSFor sake of simplicity, let us set λt= λ > 0.We would like to estimate the ideal regularized least-squares esti-mator fλρfλρ= arg minf∈HKZX×Y(f(x) − y)2dρ(z) + kfk2K.From the definition above it can be showed that fλρalso satisfiesIEz∼ρ[(fλρ(x) − y)Kx+ λfλρ] = 0,therefore, fλρis also the equilibrium point of the averaged onlineupdate equationfi= fi−1− γiIEzi∼ρ[(fi−1(xi) − yi)Kxi+ λfi−1].Generalization bound for online algorithm∗Theorem: Let fρbe the minimizer of the expected squared lossI[f] (i.e. the regression function). Assume K(x, x) ≤ κ for somepositive constant κ, and L−rKfρ∈ L2(X, ρX) for some r ∈ [1/2, 1].Then letting γi= c1i−2r2r+1and λi= c2i−12r+1for some constants c1and c2, with probability greater than 1 − δ, for all i ∈ IN it holdsI[fi] ≤ I[fρ] + Ci−2r2r+1,where C depends on M, κ, r, kL−rKfρkρand δ.Note: the rates of convergence Oi−2r2r+1are the best theoreticallyattainable under these assumptions.∗Online learning as stochastic approximations of regularization paths. Tarres,Yao. 05The Perceptron AlgorithmWe consider the classification problem: Y = {−1, +1}.We deal with linear estimators fi(x) = ωi· x, with ωi∈ IRd.The 0-1 loss E(fi(x), y) = Θ(−y(ωi· x)) is the natural choice inthe classification context. We will also consider the more tractablehinge-lossV (fi(x), y) = |1 − y(ωi· x)|+.Update rule:If Ei= E(fi−1(xi), yi) = 0 then ωi= ωi−1, otherwiseωi= ωi−1+ yixiThe Perceptron Algorithm (cont.)Passive-Aggressive strategy of the update rule.If fi−1classifies correctly xi, don’t move.If fi−1classifies incorrectly, try to increase the margin yi(ω ·xi). Infact,yi(ωi· xi) = yi(ωi−1· xi) + y2ikxik2> yi(ωi−1· xi)Perceptron Convergence Theorem∗Theorem: If the samples z1, . . . , znare linearly separable, then pre-senting them cyclically to the Perceptron algorithm, the sequenceof weight vectors ωiwill eventually converge.We will proof a more general result encompassing both the separableand the inseparable cases∗Pattern Classification. Duda, Hart, Stork, 01Mistakes’ Bound∗Theorem: Assume kxik ≤ R for every i = 1, 2, . . . , n. Then forevery u ∈ IRdM ≤Rkuk +vuutnXi=1ˆV2i2,whereˆVi= V (u · xi, yi) and M is the total number of mistakes:M =Pni=1Ei=Pni=1E(fi−1(xi), yi).∗Online Passive-Aggressive Algorithms. Crammer et al, 03Mistakes’ Bound (cont.)• the boundedness conditions kxik ≤ R is necessary.• in the separable case, there exists u∗inducing margins yi(u∗·xi) ≥1, and therefore null “batch” loss over sample points. TheMistakes’ Bound becomesM ≤ R2ku∗k2.• in the inseparable case, we can let u be the best possible linearseparator. The bound compares the online performance withthe best batch performance over a given class of competitors.ProofThe terms ωi· u increase as i increases1. If Ei= 0 then ωi· u = ωi−1· u2. If Ei= 1, sinceˆVi= |1 − yi(xi· u)|+,ωi· u = ωi−1· u + yi(xi· u) ≥ ωi−1· u + 1 −ˆVi.3. Hence, in both cases ωi· u ≥ ωi−1· u + (1 −ˆVi)Ei4. Summing up, ωn· u ≥ M −Pni=1ˆViEi.Proof (cont.)The terms kωik do not increase too quickly1. If Ei= 0 then kωik2= kωi−1k22. If Ei= 1, since yi(ωi−1· xi) ≤ 0,kωik2= (ωi−1+ yixi) · (ωi−1+ yixi)= kωi−1k2+ kxik2+ 2yi(ωi−1· xi) ≤ kωi−1k2+ R2.3. Summing up, kωnk2= MR2.Proof (cont.)Using the estimates for ωn·u and kωnk2, and applying Cauchy-Schwartz inequality1. By C-S, ωn· u ≤ kωnkkuk, henceM −nXi=1ˆViEi≤ ωn· u ≤ kωnkkuk ≤√MRkuk2. Finally, by C -S,Pni=1ˆViEi≤qPni=1ˆV2iqPni=1E2i, hence√M −vuutnXi=1ˆV2i≤ Rkuk.The Experts FrameworkWe will focus on the classification case.Suppose we have a pool of prediction strategies, called experts.Denote by E = {E1, . . . , En}.Each expert predicts yibased on xi.We want to combine these experts to produce a single master al-gorithm for classification and prove bounds on how much worse itis than the best expert.The Halving Algorithm∗Suppose all the experts are functions (their predictions for a pointin the space do not change over time) and at least one of them isconsistent with the data.At each step, predict what the
View Full Document