CS281B/Stat241B (Spring 2008) Statistical Learning Theory Lecture: 21Online Learning: Halving Algorithm and Exponential WeightsLecturer: Sasha Rakhlin Scribe: Ariel KleinerThis lecture introduces online learning, in which we largely eschew statistical ass umptions and insteadconsider the behavior of individual sequences of observations and predictions.See http://seed.ucsd.edu/~mindreader for a demonstration.In general, we will think of an algorithm as a “player” and a source of data as an “adversary.”1 Halving AlgorithmSupp ose that we (the player) have access to the predictions of N “experts.” Denote these predictions byf1,t, . . . , fN,t∈ {0, 1}.At each t = 1, . . . , T , we observe f1,t, . . . , fN,tand predict pt∈ {0, 1}. We then observe yt∈ {0, 1} and sufferloss 1(pt6= yt). Suppose ∃j such that fj,t= ytfor all t ∈ [T ].Halving Algorithm: predict pt= majority(Ct), where C1= [N] and Ct⊆ [N] is defined below for t > 1.Theorem 1.1. If pt= majority(Ct) andCt+1= {i ∈ Ct: fi,t= yt}then we will make at most log2N mistakes.Proof. For every t at which there is a mistake, at least half of the experts in Ctare wrong and so|Ct+1| ≤|Ct|2.It follows immediately that|CT| ≤|C1|2Mwhere M is the total number of mistakes. Additionally, because there is a perfect expert, |CT| ≥ 1. As aresult, recalling that C1= [N],1 ≤N2Mand, rearranging,M ≤ log2N.12 Online Learning: Halving Algorithm and Exponential Weights2 Exponential Weights or Weighted MajorityWe now change our assumptions about the game. For t = 1, . . . , T , the player observesf1,t, . . . , fN,t∈ [0, 1]and predicts pt∈ [0, 1]. The outcome yt∈ [0, 1] is then revealed, and the player suffers loss l(pt, yt); theexperts suffer losses l(fi,t, yt), ∀i. We assume that the loss function l : [0, 1] × [0, 1] → [0, 1] is convex in itsfirst argument. Our goal is to achieve low regret RT, defined asRT=TXt=1l(pt, yt)|{z }LT− mini∈[N]TXt=1l(fi,t, yt)| {z }Li,T.Exponential Weights (or Weighted Majority) Algorithm: Maintain an (unnormalized) distributionover [N] given by the weightswi,t= e−ηLi,t−1and predictpt=PNi=1wi,tfi,tPNi=1wi,t.Note that the weights can be defined equivalently by letting wi,1= 1 andwi,t+1= wi,te−ηl(fi,t,yt)Theorem 2.1. With an appropriate choice of η,RT= O(√T ).In fact, with η =q8 ln NT,RT≤rT2ln N.Proof. Define Wt=PNi=1wi,t. Recall that, by definition, wi,1= 1, ∀i and so W1= N. Now,lnWT +1W1= lnNXi=1wi,T +1− ln N= lnNXi=1e−ηLi,T− ln N≥ lnmaxi=1,...,Ne−ηLi,T− ln N= −η mini=1,...,NLi,T− ln N. (1)Online Learning: Halving Algorithm and Exponential Weights 3Additionally,lnWt+1Wt= lnPNi=1wi,t+1PNi=1wi,t= lnPNi=1e−ηl(fi,t,yt)wi,tPNi=1wi,t≤ −ηPNi=1l(fi,t, yt)wi,tPNi=1wi,t+η28(2)≤ −ηl(pt, yt) +η28. (3)Inequality (2) holds because of Hoeffding’s inequality:ln EesX≤ sEX +s2(a − b)28for any random variable X ∈ [a, b] and any s ∈ R. The role of X in (2) above is played by l(fi,t, yt), andthe role of s is played by −η. Inequality (3) follows from Jensen’s inequality because l is convex in its firstargument.Using (3), we find thatlnWT +1W1= lnWT +1WT+ lnWTWT −1+ ··· + lnW2W1≤ −ηTXt=1l(pt, yt) + Tη28.Therefore, combining this inequality with the lower bound (1) obtained above, we have−η mini=1,...,NLi,T− ln N ≤ −ηLT+ Tη28and so, rearranging,LT≤ mini=1,...,NLi,T+ln Nη+ Tη8.Finally, optimizing over η (i.e., minimizing the last two terms with respect to η), we obtain the
View Full Document