MIT 9 520 - Online Learning - D2498713

Home> Schools> Massachusetts Institute of Technology> Brain and Cognitive Sciences (9) > 9 520> Online Learning

DOC PREVIEW

MIT 9 520 - Online Learning

School name Massachusetts Institute of Technology

Course 9 520- Statistical Learning Theory and Applications

Pages 5

This preview shows page 1-2 out of 5 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 5 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 5 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 5 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Online Learning 9.520 Lecture 091 IntroductionMost of the course is concerned with the “batch” learning problem. In this lecture, however, we look at adifferent model, called “online”. Let us first compare and contrast the two.In “batch” learning, an algorithm takes the data (training samples) and returns a hypothesis. We assumea “static” nature of the world: a new example that we will encounter will be similar to the training set.More precisely, we suppose that all the training samples, as well as the test point, are independent andidentically distributed. Hence, given the bunch z1, . . . , znof training samples drawn from a distribution p,the quality of the learned function f is Ez∼p`(f, z), where ` is s ome loss function. The questions addressedby statistical learning theory are: how many examples are needed to have small expected error with certainconfidence? what is the lowest error that can be achieved under certain conditions on p? etc.If the world is not static, it might be difficult to take advantage of large amounts of data. We can no longerrely on statistical assumptions. In fact, we take an even more dramatic view of the world. We supposethat there are no correlations whatsoever between any two days. As there is no stationary distributionresponsible for the data, we no longer want to minimize some expected error. All we want is to survive nomatter how adversarial the world is. By surviving, we mean that we do not do too badly relative to other“agents” in the world. In fact, the goal is to do not much worse than the best “agent” (this differenceis called regret). Note that this does not guarantee that we will be doing well in absolute terms, as thebest agent might be quite bad. However, this is the price we have to pay for not having any coherencebetween today and tomorrow. The goal of “online learning” is, therefore, to do almost as well as the bestcomparator.Today we will describe the mos t famous setting, “prediction with expert advice”. We will then try tounderstand why the algorithm for this setting works by making some abstractions and moving to theOnline Convex Optimization framework. Finally, we will unify the two via Regularization and providesome powerful tools for proving regret guarantees.2 Prediction with Expert AdviceSuppose we have access to predictions of N experts. Denote these predictions at time t by f1,t, . . . , fN,t.Fix a convex loss function `. We s uppose that `(a, b) ∈ [0, 1] for simplicity.At each time step t = 1 to T ,• Player observes f1,t, . . . , fN,tand predicts pt• Outcome ytis revealed• Player suffers loss `(pt, yt) and experts suffer `(fi,t, yt)Denote the cumulative loss of expert i by Li,t=Pts=1`(fi,s, ys) and the cumulative loss of the player byLt=Pts=1`(ps, ys).The goal of the Player is to minimize the regretLecturer: Sasha Rakhlin 1Online Learning 9.520 Lecture 09RT=TXt=1`(pt, yt) − mini∈1...NTXt=1`(fi,t, yt) = LT− min Li,T.How can we make predictions based on the history so that RTis small?Solution (called Exponential Weights or Weighted Majority [4]): keep weights w1,t, . . . , wN,tover expertsand predictpt=PNi=1wi,t−1fi,tPNi=1wi,t−1.Once the outcome is revealed and the losses `(fi,t, yt) can be calculated, the weights are updated aswi,t= wi,t−1· exp(−η`(fi,t, yt)),where η is the learning rate parameter.Theorem 2. 1. For the Exponential Weights algorithm w ith η =q8 ln NT,RT≤p(T/2) ln NProof, see [3, 2]. Let Wt=PNi=1wi,t. Suppose we initialize wi,0= 1 for all experts i. Then ln W0= ln N.Furthermore,lnWTW0= lnNXi=1wi,T− ln N= lnNXi=1exp(−ηLi,T) − ln N≥ ln( maxi=1,...,Nexp(−ηLi,T)) − ln N= −η mini=1,...,NLi,T− ln N.On the other hand,lnWtWt−1= lnPNi=1wi,tPNi=1wi,t−1= lnPNi=1exp(−η`(fi,t, yt)) · exp(−ηLi,t−1)PNi=1exp(−ηLi,t−1)= lnPNi=1exp(−η`(fi,t, yt)) · wi,t−1PNi=1wi,t−1≤ −ηPNi=1`(fi,t, yt)wi,t−1PNi=1wi,t−1+η28≤ −η`(pt, yt) +η28Lecturer: Sasha Rakhlin 2Online Learning 9.520 Lecture 09where the last inequality follows by the definition of ptand convexity of ` (via an application of Jensen’sinequality). The next to last inequality holds because for a random variable X ∈ [a, b],ln EesX≤ sEX +s2(b − a)28for any s ∈ R. See [3, 2] for more details.Summing the last inequality over t = 1, . . . , T and observing that logs telesc ope,lnWTW0≤ −ηTXt=1`(pt, yt) + η2T/8.Combining the upper and lower bounds for ln WT/W0,LT≤ mini,TLi,T+ln Nη+η8T.Balancing the two terms with η =q8 ln NTgives the bound.We’ve presented the “Prediction with Expert Advice” framework and the Exponential Weights algorithmbecause they are the most widely known. Howe ver, the proof above does not give us much insight intowhy things work. Let us now go to a somewhat simpler se tting by removing the extra layer of combiningpredictions of experts to make our own prediction. The following setting is closely related, but s impler tounderstand.3 Online Gradient DescentConsider the following repeated game:At each time step t = 1 to T ,• Player predicts wt, a distribution over N experts• Vector of losses `t∈ RNis revealed• Player suffers `t· wt, the vector productThe goal is to minimize the regret, defined asRT=TXt=1`twt− minw∗∈N−simplexTXt=1`tw∗.Since the minimum over the simplex occurs at the vertices, we could equivalently write mini.Lecturer: Sasha Rakhlin 3Online Learning 9.520 Lecture 09One can see that the game is closely related to the one introduced in the beginning of the lecture. Here wtare the normalized versions of the weights kept by the Exponential Weights algorithm. Think of the lossvector `tas the vector of `(fi,t, yt).The repeated game we just defined is called an Online Linear Optimization game. In fact, we can definethe Online Linear Optimization over any convex set K. In this caseRT=TXt=1`twt− minw∗∈KTXt=1`tw∗.Suppose we just perform gradient steps wt+1= wt−η`t, followed by a projection onto the set K. It turnsout that this is a very goo d (often optimal) algorithm. Let’s prove that this algorithm is good.Theorem 3.1 (Online Gradient Descent [5]). Suppose the Online Linear Game is performed by updatingwt+1= wt− η`twith η = T−1/2, followed by the projection onto the set K. Then the regretRT≤ GD√T ,where G is the maximum norm of the gradient of `t’s and D is the diameter of K.Proof. By the definition of the update,kwt+1− w∗k2≤ kwt− η`t− w∗k2= kwt−

View Full Document


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2 out of 5 pages.

MIT 9 520 - Online Learning

Sign up for free to view:

Please select your school