DOC PREVIEW
UCLA COMSCI 260 - notes_1027

This preview shows page 1-2 out of 5 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 5 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 5 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 5 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

CS269: Machine Learning TheoryLecture 10: Weighted Majority and Online Convex OptimizationOctober 27, 2010Lecturer: Jake AbernethyScribe: Karan Chaudhry and Sendie Hudaya1 Randomized Weighted MajorityIn the last lecture, we introduced the Randomized Weighted Majority algorithm and began to prove a regretbound. Recall that Randomized Weighted Majority assigns an exponential weight wt(i) to each expert i attime t defined as follows:wt(i) = e−ηLt−1(i)where η is a learning parameter, lt∈ [0, 1]nis the vector of expert losses at time t, and Ltis the vector ofcumulative losses,Lt=tXs=1lsWe initialize each w1(i) to 1 and for each round t, the weight wt(i) is updated according to learning param-eter η and loss at t − 1. We wish to decrease the weight of an expert if it has high loss, and increase theweight if its loss is low. Also recall that in Randomized Weighted Majority algorithm, the player choosespredictions at random among experts. The probability of choosing an expert is simply the weight of thatexpert’s prediction normalized by the sum of all expert weights:pt(i) =wt(i)WW =nXi=1wt(i)In the last lecture, we started to prove the following regret bound for Randomized Weighted Majority.Theorem 1. Let L∗be a known upper bound on the loss of the best performing expert, and assume L∗≥2 log n. The regret of Randomized Weighted Majority run with parameter η =p2 log n/L∗is no more than2√2L∗log n.Note that we can set L∗= T if no other information is known.1This bound implies that the regret ofRandomized Weighted Majority is O(√T log n). Last time we introduced two tricks that will help us provethe bound on regret:Fact 1. log(1 + x) ≤ x, ∀x (Note this is equivalent to 1 + x ≤ exfrom lecture 2)1Of course we still need to know T to do this. There are tricks for achieving low regret when T is not known in advance,including the common “doubling trick”, but we won’t get to those here.1Fact 2. eαx− 1 ≤ (eα− 1)x, ∀α, ∀x ∈ [0, 1]Let us now complete the proof of the regret bound for Randomized Weighted Majority.Proof: By the end of last lecture, we arrived at the following inequality, where i∗is the “best” expert, thatis, the expert with the lowest cumulative loss:TXt=1pt· lt≤11 − e−η log nXi=1w1(i)!− log nXi=1wT +1(i)!!Using this, we can bound the cumulative loss of the algorithm as follows:2TXt=1pt· lt≤11 − e−η log nXi=1w1(i)!− log nXi=1wT +1(i)!!=11 − e−η log(n) − log"nXi=1e−ηLT(i)#!(1)≤11 − e−ηlog(n) − loge−ηLT(i∗)=11 − e−η(log(n) + ηLT(i∗))≤11 − e−log(1+η)(log(n) + ηLT(i∗)) (2)=11 −11+η(log(n) + ηLT(i∗))=1 + ηη(log(n) + ηLT(i∗))Remember that w1is initialized to 1 for all i, soPni=1w1(i) = n. Equation 2 follows from the fact that thesum of expert losses is less than the loss of any single expert. Equation 2 is true by Fact 1.Assume that η ≤ 1. (This will be true by the assumptions we have made in the theorem statement.) Inthis case, the above gives us thatTXt=1pt· lt≤ LT(i∗) + ηLT(i∗) +2ηlog(n)and soTXt=1pt· lt− LT(i∗) ≤ ηLT(i∗) +2ηlog(n) ≤ ηL∗+2ηlog(n) .This holds for any value of η ∈ (0, 1]. It is minimized when we set η =p2 log n/L∗, so we will usethis value of η. The bound above then becomesTXt=1pt· lt− LT(i∗) ≤ 2p2L∗log n.2Note that this bound is presented slightly differently than it was in class. This presentation is a bit more simple, and doesn’trequire introducing an additional parameter β.21.1 Using a Prior Over the ExpertsAlternatively, we can change the initial weights to be non-uniform. Let w1∈ ∆nbe the initial weights.Then wt(i) = e−ηLt−1(i)w1(i). The calculation for new total loss becomes:11 − e−η log(1) − log nXi=1wT +1(i)!!=11 − e−η 0 − log"nXi=1e−ηLT(i)w1(i)#!≤11 − e−η(log(1/w1(i∗)) + ηLT(i∗))Now, the new bound on regret becomes:2s2L∗log1w1(i∗)This bound is better if the best expert has a high initial weight.2 Online Convex OptimizationWe now consider a generalization of the expert advice setting. A convex programming problem consists ofa convex feasible set K ⊆ Rnand a convex cost function f : K → R. In online convex optimization, analgorithm faces a sequence of convex programming problems, each with the same feasible set but differentcost functions. Each time the algorithm must choose a point before it observes the cost function. This is ageneralization of both work in minimizing error online and of the experts problem.In the experts problem, one has n experts, each of which has a loss in [0, 1] at each round. At each round, thealgorithm selects a probability distribution over experts. The set of all probability distributions is a convexset. Also, the cost function on this set is linear, and therefore convex.The online convex optimization problem can be formulated as a repeated game between a player and anadversary. At round t, the player chooses an action wtfrom some convex subset K of Rn, and the adversarychooses a convex loss function ft. The players goal is to ensure that the total loss,PTt=1ft(wt), does notdiffer from the smallest total lossPTt=1ft(w) for any fixed action w by too much. The difference betweenthe total loss and its optimal value for a fixed action is termed regret, and denoted asRT=TXt=1ft(wt) − minw∈KTXt=1ft(w)Recall the following definitions, which we will use in our analysis.Definition 1. In Euclidean space, a set K is said to be convex if,∀x, y ∈ K, (1 − α)x + αy ∈ K for any α ∈ [0, 1]3Definition 2. A function f is convex, if,∀x, y ∈ K, αf(x) + (1 − α)f(y) ≥ f(αx + (1 − α)y) for any α ∈ [0, 1]Geometrically, this means that if you draw a line segment between the points(x, f(x)) and (y, f(y)), thefunction lies below this line segment.Fact 3. For a convex differentiable function f,f(x) − f(y) ≤ ∇f(x)(x − y)Geometrically, the above statement means that if you draw the tangent plane to the function at point x, thefunction lies above it. Recall also that k a − b k2=k a k2+ k b k2−2ab.2.1 Online Gradient Descent (Zinkevich ’03)We now introduce and analyze the Online Gradient Descent algorithm for online convex optimization. Thealgorithm is as follows:1. Arbitrarily set w12. for t = 1 to T do3. Choose wt, observe ft4. Update ˜wt+1= wt− η∇ft(wt) where η > 0 is a learning parameter5. Set wt+1= Euclidean projection of ˜wt+1on the convex set K6. end forWe set


View Full Document

UCLA COMSCI 260 - notes_1027

Download notes_1027
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view notes_1027 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view notes_1027 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?