UCLA COMSCI 260 - notes_1027 - D153382

Home> Schools> University of California, Los Angeles> Computer Science (COMSCI) > COMSCI 260> notes_1027

DOC PREVIEW

UCLA COMSCI 260 - notes_1027

School name University of California, Los Angeles

Course Comsci 260- Machine Learning Theory

Pages 5

This preview shows page 1-2 out of 5 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 5 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 5 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 5 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

CS269: Machine Learning TheoryLecture 10: Weighted Majority and Online Convex OptimizationOctober 27, 2010Lecturer: Jake AbernethyScribe: Karan Chaudhry and Sendie Hudaya1 Randomized Weighted MajorityIn the last lecture, we introduced the Randomized Weighted Majority algorithm and began to prove a regretbound. Recall that Randomized Weighted Majority assigns an exponential weight wt(i) to each expert i attime t defined as follows:wt(i) = e−ηLt−1(i)where η is a learning parameter, lt∈ [0, 1]nis the vector of expert losses at time t, and Ltis the vector ofcumulative losses,Lt=tXs=1lsWe initialize each w1(i) to 1 and for each round t, the weight wt(i) is updated according to learning param-eter η and loss at t − 1. We wish to decrease the weight of an expert if it has high loss, and increase theweight if its loss is low. Also recall that in Randomized Weighted Majority algorithm, the player choosespredictions at random among experts. The probability of choosing an expert is simply the weight of thatexpert’s prediction normalized by the sum of all expert weights:pt(i) =wt(i)WW =nXi=1wt(i)In the last lecture, we started to prove the following regret bound for Randomized Weighted Majority.Theorem 1. Let L∗be a known upper bound on the loss of the best performing expert, and assume L∗≥2 log n. The regret of Randomized Weighted Majority run with parameter η =p2 log n/L∗is no more than2√2L∗log n.Note that we can set L∗= T if no other information is known.1This bound implies that the regret ofRandomized Weighted Majority is O(√T log n). Last time we introduced two tricks that will help us provethe bound on regret:Fact 1. log(1 + x) ≤ x, ∀x (Note this is equivalent to 1 + x ≤ exfrom lecture 2)1Of course we still need to know T to do this. There are tricks for achieving low regret when T is not known in advance,including the common “doubling trick”, but we won’t get to those here.1Fact 2. eαx− 1 ≤ (eα− 1)x, ∀α, ∀x ∈ [0, 1]Let us now complete the proof of the regret bound for Randomized Weighted Majority.Proof: By the end of last lecture, we arrived at the following inequality, where i∗is the “best” expert, thatis, the expert with the lowest cumulative loss:TXt=1pt· lt≤11 − e−η log nXi=1w1(i)!− log nXi=1wT +1(i)!!Using this, we can bound the cumulative loss of the algorithm as follows:2TXt=1pt· lt≤11 − e−η log nXi=1w1(i)!− log nXi=1wT +1(i)!!=11 − e−η log(n) − log"nXi=1e−ηLT(i)#!(1)≤11 − e−ηlog(n) − loge−ηLT(i∗)=11 − e−η(log(n) + ηLT(i∗))≤11 − e−log(1+η)(log(n) + ηLT(i∗)) (2)=11 −11+η(log(n) + ηLT(i∗))=1 + ηη(log(n) + ηLT(i∗))Remember that w1is initialized to 1 for all i, soPni=1w1(i) = n. Equation 2 follows from the fact that thesum of expert losses is less than the loss of any single expert. Equation 2 is true by Fact 1.Assume that η ≤ 1. (This will be true by the assumptions we have made in the theorem statement.) Inthis case, the above gives us thatTXt=1pt· lt≤ LT(i∗) + ηLT(i∗) +2ηlog(n)and soTXt=1pt· lt− LT(i∗) ≤ ηLT(i∗) +2ηlog(n) ≤ ηL∗+2ηlog(n) .This holds for any value of η ∈ (0, 1]. It is minimized when we set η =p2 log n/L∗, so we will usethis value of η. The bound above then becomesTXt=1pt· lt− LT(i∗) ≤ 2p2L∗log n.2Note that this bound is presented slightly differently than it was in class. This presentation is a bit more simple, and doesn’trequire introducing an additional parameter β.21.1 Using a Prior Over the ExpertsAlternatively, we can change the initial weights to be non-uniform. Let w1∈ ∆nbe the initial weights.Then wt(i) = e−ηLt−1(i)w1(i). The calculation for new total loss becomes:11 − e−η log(1) − log nXi=1wT +1(i)!!=11 − e−η 0 − log"nXi=1e−ηLT(i)w1(i)#!≤11 − e−η(log(1/w1(i∗)) + ηLT(i∗))Now, the new bound on regret becomes:2s2L∗log1w1(i∗)This bound is better if the best expert has a high initial weight.2 Online Convex OptimizationWe now consider a generalization of the expert advice setting. A convex programming problem consists ofa convex feasible set K ⊆ Rnand a convex cost function f : K → R. In online convex optimization, analgorithm faces a sequence of convex programming problems, each with the same feasible set but differentcost functions. Each time the algorithm must choose a point before it observes the cost function. This is ageneralization of both work in minimizing error online and of the experts problem.In the experts problem, one has n experts, each of which has a loss in [0, 1] at each round. At each round, thealgorithm selects a probability distribution over experts. The set of all probability distributions is a convexset. Also, the cost function on this set is linear, and therefore convex.The online convex optimization problem can be formulated as a repeated game between a player and anadversary. At round t, the player chooses an action wtfrom some convex subset K of Rn, and the adversarychooses a convex loss function ft. The players goal is to ensure that the total loss,PTt=1ft(wt), does notdiffer from the smallest total lossPTt=1ft(w) for any fixed action w by too much. The difference betweenthe total loss and its optimal value for a fixed action is termed regret, and denoted asRT=TXt=1ft(wt) − minw∈KTXt=1ft(w)Recall the following definitions, which we will use in our analysis.Definition 1. In Euclidean space, a set K is said to be convex if,∀x, y ∈ K, (1 − α)x + αy ∈ K for any α ∈ [0, 1]3Definition 2. A function f is convex, if,∀x, y ∈ K, αf(x) + (1 − α)f(y) ≥ f(αx + (1 − α)y) for any α ∈ [0, 1]Geometrically, this means that if you draw a line segment between the points(x, f(x)) and (y, f(y)), thefunction lies below this line segment.Fact 3. For a convex differentiable function f,f(x) − f(y) ≤ ∇f(x)(x − y)Geometrically, the above statement means that if you draw the tangent plane to the function at point x, thefunction lies above it. Recall also that k a − b k2=k a k2+ k b k2−2ab.2.1 Online Gradient Descent (Zinkevich ’03)We now introduce and analyze the Online Gradient Descent algorithm for online convex optimization. Thealgorithm is as follows:1. Arbitrarily set w12. for t = 1 to T do3. Choose wt, observe ft4. Update ˜wt+1= wt− η∇ft(wt) where η > 0 is a learning parameter5. Set wt+1= Euclidean projection of ˜wt+1on the convex set K6. end forWe set

View Full Document


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2 out of 5 pages.

UCLA COMSCI 260 - notes_1027

Sign up for free to view:

Please select your school