UCLA COMSCI 260 - lecture7 - D637205

Home> Schools> University of California, Los Angeles> Computer Science (COMSCI) > COMSCI 260> lecture7

DOC PREVIEW

UCLA COMSCI 260 - lecture7

School name University of California, Los Angeles

Course Comsci 260- Machine Learning Theory

Pages 3

This preview shows page 1 out of 3 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 3 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 3 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

CS260: Machine Learning TheoryLecture 7: Online Classification and Mistake BoundsOctober 17, 2011Lecturer: Jennifer Wortman Vaughan1 Online ClassificationSo far we have been considering the batch learning setting in which the learning algorithm is presented witha sample of training data and must produce a hypothesis that performs well on new data generated from thesame distribution. For the next few weeks, we will shift our attention to the online learning setting, in whichthe learning algorithm is presented with a sequence of examples over time, and must repeatedly update itshypothesis based on these examples. The online learning setting can be used to model applications likespam filtering, in which the algorithm must adapt to feedback.A Basic Online Classification ModelIn the basic online setting, at each round t ∈ {1, 2, 3, · · · },1. The learner is presented with a new example ~xt.2. The learner must predict a label bytfor this example.3. After the prediction is made, the true label ytof the example is revealed.4. The learner updates its prediction rule based on ~xt, byt, and yt.Unlike the PAC learning model, no distributional assumptions are made about the sequence of examples~x1, ~x2, · · · . The online learning setting is therefore “adversarial” in the sense that we can imagine theexamples are generated by an adversary who would like to force our algorithm to make as many mistakesas possible. Because of this, there are a lot of connections between online learning and game theory, someof which we will discuss in upcoming classes.The learning algorithm is said to make a mistake on any round t at which yt6= byt. There are severalreasonable goals that could be considered in this setting. We will discuss two:1. Minimize the number of mistakes made by the algorithm. In this case, we would like to find a boundon the total number of mistakes made by the algorithm such that the ratio# of mistakes# of roundstends to zero as number of rounds gets large.All CS260 lecture notes build on the scribes’ notes written by UCLA students in the Fall 2010 offering of this course. Althoughthey have been carefully reviewed, it is entirely possible that some of them contain errors. If you spot an error, please email Jenn.1In order to achieve such a goal, it is necessary to make some assumptions about the way in which thelabels ytare generated. For example, we might assume that there exists a target function c in a class Csuch that for all t, yt= c(~xt). This is the analog of the realizable batch learning setting we discussed.2. Minimize regret. This is the analog of the unrealizable batch learning setting, as we no longer needto assume the existence of a perfect target function. Instead, we minimize the difference between thenumber of mistakes the algorithm makes and the number of mistakes made by the best predictor orcomparator in a class of functions. In short, we would like the ratio# of mistakes − # of mistakes by comparator# of roundsto tend to zero as number of rounds grows.The regret minimization scenario may be more realistic and useful in real-world applications in whichwe do not have the luxury of assuming a perfect target exists or that our data is free of noise. However,to get started, we will focus on the first goal for the next few lectures. To get some intuition about theonline classification setting and the classes that can or cannot be learned, we will establish some generalupper and lower bounds for the Mistake Bound Model (which we will not define formally, but will discussonly informally; check out Avrim Blum’s online learning survey1for a more formal treatment). In the nextlecture, we will start looking at more interesting algorithms that can be applied to learn specific conceptclasses like linear threshold functions.2 A Simple Upper BoundWe begin by deriving an upper bound for the Mistake Bound Model that can be applied to any finite conceptclass. The learning algorithm we consider, the Halving Algorithm, makes use of the notion of a versionspace. The version space is defined to be the set of all functions from the class C that are consistent withall of the data that the algorithm has seen so far. Over time, as the algorithm sees more examples andmore functions become inconsistent, the version space decreases in size. Notice that this concept of versionspace is only sensible to consider when we make the assumption that there is a perfect target function in C;otherwise the version space could become empty.The halving algorithm works as follows. At each round t = 1, 2, 3, · · · ,1. When a new example ~xtarrives, set bytto be the label chosen for ~xtby the majority of functions in thecurrent version space, VSt.2. When the true label ytis revealed, update the version space to VSt+1.Note that this algorithm can, in general, be terribly inefficient. As a result, the upper bound given in thistheorem says nothing about what can be learned efficiently.Theorem 1. Let C be any finite concept class. If there exists a function c ∈ C such that for all rounds t,yt= c(~xt), then the number of mistakes made by the Halving Algorithm is no more than log2|C|.1http://www.cs.cmu.edu/˜avrim/Papers/survey.pdf2Proof: We will analyze the size of the version space over time as the number of mistakes grows. We startwith the simple observation that the version space cannot possibly be bigger than the size of the entireconcept class C. That is, |VSt| ≤ |C| for all t.If a first mistake is made on some round t1, it implies that the majority of the functions in VSt1wereincorrect about the label of xt1, and are now inconsistent with the data. Therefore, at least half of thefunctions in the VSt1will be eliminated, and we know that |VSt| ≤ |C|/2 for all t > t1.If a second mistake is made on some round t2, then half of the remaining functions in the version spacewill be eliminated. We have that |VSt| ≤ |C|/4 for all t > t2.More generally, if the kth mistake is made on some round tk, then we know that |VSt| ≤ |C|/2kfor allt > tk. Since we are in a setting in which there exists a perfect target function in C, we know that for all t,|VSt| ≥ 1.Combining these expressions, we get that on any round t after k mistakes have been made,1 ≤ |VSt| ≤|C|2kwhich implies that k ≤ log |C|. If the halving algorithm made more than log |C| mistakes, this conditionwould be violated, so it must make fewer.This result has the familiar logarithmic dependence on the size of the concept class,

View Full Document


School:
Email:
New Password:
Confirm Password:

This preview shows page 1 out of 3 pages.

UCLA COMSCI 260 - lecture7

Sign up for free to view:

Please select your school