Unformatted text preview:

CS269: Machine Learning TheoryLecture 6: Online LearningOctober 13, 2010Lecturer: Jennifer Wortman VaughanScribe: Shilpa Vijayan, Petch Wannissorn1 Online Learning SettingIn this class, we shift to a different learning model called the online learning setting. So far we have beenconsidering the batch learning setting where an algorithm is first given a set of training data and it comesup with a function that performs well on new data drawn from the same distribution. The goal of Batchlearning is to receive all data at once and select the best function. However it may be unrealistic to expectto get a representative data set up front and come upon a function that performs well on all data that comesthereafter. In an online learning setting, we run an algorithm to learn a function over time and update it whenit gets more data. For example, consider the spam filtering problem. The spam filter updates its predictionfunction when the email user corrects its mistake, e.g. by explicitly marking an email that got past the spamfilter as spam.Basic Online SettingIn a basic online setting, an algorithm proceeds in a series of rounds.At each round t,1. The learning algorithm sees a new example ~xt.2. The algorithm predicts a label bytfor this example.3. After the prediction, the true label ytis revealed.4. Algorithm makes a mistake if the predicted label is not equal to the true label i.e. yt6= byt5. Update the function used to make the prediction.In this learning setting unlike the PAC learning model, no distributional assumptions are made about the~xt. Hence the online learning setting is adversarial in the sense that no distributional assumptions are beingmade and there are no restrictions on where the examples come from. We can think of the examples as beinggenerated by an adversary and, because of this, it turns out there are a lot of connections between onlinelearning and game theory.There are 2 goals of learning in the online setting:1. Minimize the number of mistakes - Here we want to bound the total number of mistakes made by thealgorithm such that the ratio# of mistakes# ofrounds1tends to zero as number of rounds tend to infinity.This is analogous to the realizable setting where we assumed that there is a perfect target functionyou can apply to get a perfect labeling each time and the goal was to have the error measure of ourhypothesis function go to zero as the number of examples grows.2. Minimize regret - This is analogous to the unrealizable setting where we minimize the differencebetween the number of mistakes the algorithm makes and the number of mistakes made by the bestpredictor or comparator in a class of functions. In short, we want the ratio(# of mistakes − # of mistakes by comparator)# of roundstends to zero as number of rounds tend to infinity.The minimizing regret approach is more realistic and useful in real world applications where we cannotassume perfection and noise free data. However, we will concentrate on the first goal, mistake minimization,for today’s lecture. To get intuition about the learning model and types of things that can be learned we willestablish some simple upper and lower bounds and then look at some specific functions like the linearthreshold functions.2 Upper BoundWe will derive an upper bound that can be applied to any finite class of functions one is trying to learn.In this respect, consider the halving algorithm. The halving algorithm is an extremely simple algorithmalthough it suffers from the disadvantages of being computationally infeasible and from having to assumethe existence of a perfect target, which is sometimes unrealistic. To derive the upper bound we introducethe concept of a version space. We define version space as the set of all functions in concept class C that isconsistent with all examples seen so far i.e.Version Space = {all consistent functions in C}.As the algorithm learns the true labels of more examples, some functions will be dropped from theversion space if they become inconsistent.Halving AlgorithmTheorem 1. Let C be any finite function class. Assume ∃c ∈ C such that ∀t, yt= c( ~xt). Then the numberof mistakes made by the Halving Algorithm is no more than log |C|We define the halving algorithm as the following.Assume ∃c ∈ C such that ∀t, yt= c(~xt)or in words, assume there exists some perfect target function c in some finite class of functions C such that,for all t where t denotes a round, yt= c( ~xt), thenAt each round t,1. Predict label bytto be the same as is chosen by the majority of functions in the current version space.2. Compute the new version space.2Analysis of this algorithm is as follows:After zero mistakes, the size of the version space is upper bounded by the size of the class C i.e.|V S| ≤ |C|After 1 mistake, we can say that since the majority of the functions in the current version space labeled thepoint wrong, we can eliminate at least half of them. Hence after eliminating half of the functions from theversion space (V S) we have|V S| ≤|C|2After 2 mistakes, and eliminating another half of V S we have|V S| ≤|C|4More generally we can write,After k mistakes|V S| ≤|C|2kWe assumed that there is at least 1 perfect function in class hence 1 ≤ |V S|. Combining the twoexpressions for the upper and lower bounds we have,1 ≤ |V S| ≤|C|2k2k≤ |C| combining the upper and lower boundsk ≤ log |C| taking log on both sidesThis result has the log|C| dependence of the bound and reflects the Occam’s razor principle that the biggerthe class we start with, the more expensive it is to learn the class. As an example consider the class ofmonotone disjunctions where |C| = 2nfor n features. The number of mistakes is bounded by n for thisclass. However as mentioned above, this algorithm is computationally infeasible as it requires setting up aversion space of 2nfunctions and computing a majority label by evaluating 2nfunctions at every step of thealgorithm.3 Lower BoundWe first show the lower bound for deterministic algorithms and see that there is a dependence on VC-dimension in this model as well. Assuming that there is a perfect target c ∈ C and VC-dimension of thatclass is d i.e. d = V C(C), then in this case, you can force a deterministic algorithm A to make at least dmistakes.3The adversary knows the prediction the algorithm will make each time and has the power to choose thesequence of ~xtpoints(as there are no distributional assumptions) and also the label to coerce the


View Full Document

UCLA COMSCI 260 - Online Learning

Download Online Learning
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Online Learning and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Online Learning 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?