CS281B/Stat241B (Spring 2008) Statistical Learning Theory Lecture: 22Online Convex OptimizationLecturer: Alexander Rakhlin Scribe: Jiening Zhan1 Recap: Prediction with Expert AdviceIn the last lecture, the following online prediction algorithm based on exp ert a dv ic e was presented. Given aconvex loss function l(·),For t = 1, ..., Tplayer observes f1,t, ..., f1,Nplayer predicts ptadversary reveals outcome ytplayer suffers loss l(pt, yt)exp erts suffer loss l(fi,t, yt)EndThe go al is to minimize the regret,RT=TXt=1l(pt, yt) − mini∈[N]TXi=1l(fi,t, yt) (1)By using expo nential weights e−ηLi,t, and predictions pt=Piwi,tfi,tPiwi,t, an upper bound to the regret wasfound to be RT≤qT2log N.Based o n this algorithm, the online linear optimization and the online convex o ptimization algorithm arederived.2 Online Linear OptimizationLet ∆Ndenote the N dimensional simplex.For t = 1, ..., Tplayer predicts wt∈ ∆N(wtis essentially a probability distribution)adversary reveals lt∈ RNplayer suffers loss wt· ltwhere lt(i) = l(fi,t, yt)End12 Online Convex OptimizationThe regret is defined as,RT=TXt=1wt· lt− minw∗∈∆NTXt=1w∗· lt(2)Note that fo r the simplex, the distribution w∗will place all the probability on the best expert.Figure 1: Online Linear Optimization (K is simplex)3 Online Convex OptimizationLet K be a convex region. Also, ∀t ∈ [T ], let lt: RN→ R be conxex.For t = 1, ..., Tplayer predicts wt∈ Kadversary reveals lt(·)Online Convex Optimization 3player suffers loss lt(wt)EndThe regret isRT=TXt=1lt(wt) − minw∗∈KTXt=1lt(w∗) (3)Figure 3 gives an example o f the online convex optimization algorithm. In this algorithm, the adversary isquite ’handicapp e d.’ The worst that an adversary can do is change the loss function a s demonstrated inFigure 3. However, we can always choose a w∗such that the adversary cannot produce too much loss.Figure 2: Online Convex Optimization4 Online Convex OptimizationFigure 3: Adversary is quite handicappe d in the algorithm4 Algorithm: Online Gradient Descent, Zinkevich 03For t = 1, ..., TPredict wtObserve lt(·)Update wt+1= Πk(wt− η∇lt(wt)) where Πk(·) is the euclidean projection onto the set K.EndTheorem 4. 1. Let G = maxt∈[T ]k∆lt(wt)k and D = diameter K. The online gradient descent algorithmattains regr e t RT≤ GD√T .Online Convex Optimization 5Proof. Let ¯wt+1= wt− η∆lt(wt), w∗ = arg minw∈KPTt=1lt(w) and ∆t= ∆lt(wt).kwt+1− w∗k ≤ k¯wt+1− w∗k (4)= k ¯wt− η∆lt(wt) − w∗k2(5)= kwt− w∗k2+ η2k∆tk2− 2η∆t(wt− w∗) (6)Rearranging terms, it c an be seen that∆t(wt− w∗) ≤kwt− w∗k2− kwt+1− w∗k22η+η2k∆tk2(7)For a convex loss function lt,lt(wt) − lt(w∗) ≤ ∆t(wt− w∗) (8)Summing over t = 1, ..., T ,TXt=1(lt(wt) − lt(w∗)) ≤ ∆t(wt− w∗) (9)≤kw1− w∗k22η+η2k∆tk2(10)≤D22η+η2T G2(11)Setting η =DG√T, we get RT≤ GD√TLet K be a simplex of dimension N. It follows that D = 1, and G ≤√N. Therefore, the regret from theonline gradient descent is RT≤√T N. Compared to the regret bound from the prediction using exponentialweights RT≤qT2log N, this bound scales much mo re quickly in N.The online des c ent algorithm behaves differently from the prediction with exponential weights algorithm.Assume that we are choosing weights fro m a thr e e dimensional simplex. At time t, we choose weight wt.If the there is no loss lt= (0, 0, 0), then for both algorithms, wt+1= wt. That is, our position does notchange. However, if lt= (1 , 0, 0), then at time t + 1, in the case of online gradient descent, we will move adistance of η away while in the case of e xponential weights, we will move an exponential distance away. Thisis demonstated in Figure 4.5 Bergman DivergenceSuppose R : RN→ R is strictly convex with continuous 1st order partial derivatives.Definition. Bregman Divergence between x and y with respect to R isDR(x, y) = R(x) − R (y) − ∆R(y)(x −y) (12)Note that the Bregman distance is in general not symmetric.Example.R(x) =12kxk2⇒ DR(x, y) =12kx − yk2(13)6 Online Convex OptimizationFigure 4: Exponential weights algorithm vs. online gradient descentOnline Convex Optimization 7Example.R(x) =NXi=1(xilog xi− xi) ⇒ DR(x, y) = KL(x, y) −NXi=1xilogxiyi+NXi=1(yi− xi) (14)Property.DA+B(x, y) = DA(x, y) + DB(x, y) (15)Property.DR(x, v) + DR(v, w) = DR(u, v) + (u − v)(∆R(w) −∆R(v)) (16)Property. The Bergman projection unto a convex set K exists and is unique. Let w′be the Bergmanprojection of the point w unto the convex set K. It followsw′= arg minv∈KDR(v, w) (17)Property. Generalized Pythagorean Theorem: for all u ∈ KDR(u, w) ≥ DR(u, w′) + DR(w′, w)
View Full Document