DOC PREVIEW
UCLA COMSCI 260 - lecture11

This preview shows page 1-2 out of 5 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 5 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 5 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 5 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

CS260: Machine Learning TheoryLecture 11: Follow the Regularized LeaderOctober 31, 2011Lecturer: Jennifer Wortman Vaughan1 Last Time...In the last class, we introduced the expert advice framework.Learning from Expert AdviceAt each round t ∈ {1, 2, ··· , T },• The learner chooses a distribution ~pt.• Each expert i ∈ {1, ··· , n} suffers loss ℓi,t∈ [0, 1].• The learner suffers expected loss ~pt·~ℓt.The regret of the learning algorithm is then defined to beTXt=1~pt·~ℓt− mini∈{1,...,n}TXt=1ℓi,t.We first discussed the Randomized Weighted Majority algorithm with parameter η. At each round t,RWM chooses ~ptby settingpi,t=e−ηLi,t−1Pnj=1e−ηLj,t−1for all experts i ∈ {1, ··· , n}. We also introduced the class of Follow the Regularized Leader (FTRL)algorithms, that use weights of the form~pt= arg min~p∈∆n ηt−1Xs=1~ℓs· ~p + R(~p)!where R(·) is a convex function called the regularizer, and η > 0 is a parameter that allows us to adjust therelative impact of the two terms.Finally, we mentioned a useful fact called Jensen’s inequality, which comes up surprisingly often inmachine learning.Theorem 1 (Jensen’s Inequality). For any convex function f and any random variable X, f(E[X]) ≤E[f(X)]. Conversely, for any concave function f and any random variable X, E[f(X)] ≤ f(E[X] ).One trick for remembering which way the inequalities go is to keep in mind the following pictures.All CS260 lecture notes build on the scribes’ notes written by UCLA students in the Fall 2010 offering of this course. Althoughthey have been carefully reviewed, it is entirely possible that some of them contain errors. If you spot an error, please email Jenn.1C onvex function ff(E [x])x(a) convex graphC oncave function ff(E [x])xE [f(x)]x1x2(b) concave graph2 Weighted Majority and EntropyWe can prove that RWM is a Follow the Regularized Leader algorithm withR(~p) = −H(~p) = −nXi=1pilog1pi.To show this, it is sufficient to show that the distribution ~ptchosen by RWM at time t is the distribution ~pthat minimizesηt−1Xs=1~ls· ~p − H(~p). (1)First note that for any ~p ∈ ∆n,ηt−1Xs=1~ls· ~p − H(~p) = η~Lt−1· ~p − H(~p)= ηnXi=1Li,t−1pi−Xi=1pi· log1pi=nXi=1piηLi,t−1− log1pi= −nXi=1piloge−ηLi,t−1pi. (2)2By Jensen’s inequality, we then have that for any ~p ∈ ∆n,ηt−1Xs=1~ls· ~p − H(~p) = −nXi=1piloge−ηLi,t−1pi≥ −log nXi=1pie−ηLi,t−1pi!= −log nXi=1e−ηLi,t−1!(3)and so this is a lower bound on the quantity that the FTRL algorithm will minimize.Plugging the RWM distribution ~ptinto Equation 2, we getηt−1Xs=1~ls· ~pt− H( ~pt) = −nXi=1pi,tloge−ηLi,t−1pi,t= −nXi=1pi,tlognXj=1e−ηLj,t−1= −log nXi=1e−ηLi,t−1!(4)Since the final expression in Equation 4 is equal to the final expression in Equation 3, it must be the casethat the Randomized Weighted Majority algorithm chooses the distribution that minimizes Equation 1.3 Regret Bounds for Follow the Regularized LeaderWe will now prove a regret bound that holds for the class of Follow the Regularized Leader algorithms. Westart with a useful lemma.3.1 The Advantage of Knowing the FutureWe first prove a lemma that can be viewed as a regret bound for a hypothetical algorithm that chooses thedistribution ~p that minimizesηtXs=1~ℓs· ~p + R(~p)at each time t; that is, a hypothetical algorithm that uses the distribution ~pt+1instead of ~ptat time t. Notethat it is not actually possible to run such an algorithm since~ℓtis not known to the algorithm at the timewhen ~ptis chosen. However, we will be able to use this bound to derive the regret bound for FTRL.Lemma 1 (Be-the-Regularized-Leader Lemma). Let ~ptbe the distribution chosen by Follow the RegularizedLeader at time t. For any ~p ∈ ∆n, for any η > 0,TXt=1~ℓt· ~pt+1−TXt=1~ℓt· ~p ≤1η(R(~p) − R(~p1)) . (5)3Proof: This proof is by induction on T .Consider the base case of T = 0. The left hand side of the equation is 0. By definition of the algorithm,we know that~p1= arg min~p∈∆n(η ·0 + R(~p)) = arg min~p∈∆nR(~p) .Therefore, for any ~p,RHS =1η(R(~p) − R(~p1)) ≥ 0 = LHS,and so our base case holds.Now suppose that the equation holds for every value up to T − 1. Then for all ~p ∈ ∆n,T −1Xt=1~ℓt· ~pt+1−T −1Xt=1~ℓt· ~p ≤1η(R(~p) − R( ~p1)),andT −1Xt=1~ℓt· ~pt+1+1ηR( ~p1) ≤T −1Xt=1~ℓt· ~p +1ηR(~p) .Since the inequality holds for any distribution ~p, we can plug in ~p = ~pT +1and getT −1Xt=1~ℓt· ~pt+1+1ηR( ~p1) ≤T −1Xt=1~ℓt· ~pT +1+1ηR(~pT +1) .Adding~ℓT· ~pT +1to both sides we getTXt=1~ℓt· ~pt+1+1ηR( ~p1) ≤TXt=1~ℓt· ~pT +1+1ηR(~pT +1) .By definition, ~pT +1is the distribution ~p that minimizes the right hand side of this equation. Therefore, anyother ~p can only increase the value of RHS. Therefore, for any ~p ∈ ∆n,TXt=1~ℓt· ~pt+1+1ηR( ~p1) ≤TXt=1~ℓt· ~p +1ηR(~p).Rearranging terms, we see that Equation 5 holds for T , proving the lemma.3.2 The Regret Bound for Follow the Regularized LeaderUsing this lemma, we can bound the regret of FTRL. Rearranging Equation 5, we get that for any ~p ∈ ∆n,−TXt=1~ℓt· ~p ≤ −TXt=1~ℓt· ~pt+1+1η(R(~p) − R(~p1)) .4AddingTPt=1~ℓt· ~ptto both sides yieldsTXt=1~ℓt· ~pt−TXt=1~ℓt· ~p ≤TXt=1~ℓt· ~pt−TXt=1~ℓt· ~pt+1+1η(R(~p) − R(~p1))=TXt=1(~ℓt· ~pt−~ℓt· ~pt+1) +1η(R(~p) − R(~p1)) . (6)We can see that the left hand side of this inequality is the regret of FTRL with respect to the function ~p.Since this holds for any ~p, it holds for the ~p with minimal cumulative loss, which we know will put all ofits weight on a single expert. Therefore, this gives us a regret bound if we can bound the terms on the right.We have proved the following theorem.Theorem 2. For any sequence of losses~ℓ1, ··· ,~ℓT, let ~p1, ··· , ~pTbe the distributions chosen by Follow theRegularized Leader with parameter η and regularizer R. ThenTXt=1~ℓt· ~pt− mini∈{1,··· ,n }TXt=1ℓi,t≤TXt=1(~ℓt· ~pt−~ℓt· ~pt+1) +1ηmax~p∈∆nR(~p) − min~p∈∆nR(~p).Both terms in this bound depend on the choice of regularizer. The first term in the bound measures howquickly the algorithm’s distribution changes from one time step to the next, that is, how stable the algorithmis. (Remember, as we saw with the Follow the Leader algorithm, instability is bad in this setting.) The


View Full Document

UCLA COMSCI 260 - lecture11

Download lecture11
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view lecture11 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view lecture11 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?