UCLA COMSCI 260 - Lecture 8 - D2356502

Home> Schools> University of California, Los Angeles> Computer Science (COMSCI) > COMSCI 260> Lecture 8

DOC PREVIEW

UCLA COMSCI 260 - Lecture 8

School name University of California, Los Angeles

Course Comsci 260- Machine Learning Theory

Pages 5

This preview shows page 1-2 out of 5 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 5 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 5 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 5 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

CS260: Machine Learning TheoryLecture 8: The Perceptron AlgorithmOctober 19, 2011Lecturer: Jennifer Wortman Vaughan1 PreliminariesIn this lecture, we will analyze the Perceptron algorithm for learning n-dimensional linear threshold func-tions or n-dimensional linear separators. (We will use these terms interchangeably.) For today’s class, wewill use the label set {−1, +1} instead of {0, 1}. This notational change doesn’t make any difference interms of the meaning of the learning problem, but it will make some of our derivations easier. In these notes,we will use bold letters to represent vectors, and kwk denotes the length of the vector w.We can map any n-dimensional linear separator that passes through the origin to an n-dimensionalweight vector w such that w · x ≥ 0 for all positive points x, and w · x < 0 for all negative points. Thisvector w is any normal vector of the decision boundary. For this lecture, we will restrict our attention tolinear separators that pass through the origin. However, this restriction is without loss of generality as anyn-dimensional linear separator can be represented as an (n + 1)-dimensional linear separator that passesthrough the origin by adding a “dummy feature” that is always equal to 1 to each input point x. (Exercise:Work out the details and convince yourself this is true.)2 The MarginSuppose that we run a learning algorithm on a data set and it outputs a linear threshold function. Intuitivelyspeaking, we can probably be relatively confident that points that are far from the decision boundary arelabeled correctly, while we may be less confident about points that are very close to the decision boundary,since a small change to the boundary would result in different labels for these points. It would be nice if wecould find a decision boundary such that no points are too close. We formalize this idea by introducing thenotion of a margin.Definition 1. Given a linear separator represented by its normal vector w, the margin γ of a point x withlabel y ∈ {−1, +1 } is the distance between x and the decision boundary. That is,γ = ywkwk· x.Let’s verify that this expression does indeed give the distance between x and the decision boundary.If x lies on the decision boundary, then x must be orthogonal to w, and so we get γ = 0 as desired.All CS260 lecture notes build on the scribes’ notes written by UCLA students in the Fall 2010 offering of this course. Althoughthey have been carefully reviewed, it is entirely possible that some of them contain errors. If you spot an error, please email Jenn.1Suppose x does not lie on the boundary. Let z be the projection of x onto the decision boundary, i.e., zis the closest point to x that lies on the boundary. Note that xi−ziis parallel to w. If y = +1, we have thatx − z = γwkwkz = x − γwkwkz · w = x · w − γw · wkwk0 = x · w − γkwkγ = x ·wkwk= yx ·wkwkIf y = −1, the derivation is similar, except we start withx − z = −γwkwkwhich leads us toγ = −x ·wkwk= yx ·wkwk.These ideas are illustrated in the figure below.Figure 2: γiis the distance between the point xiand the separator.23 The Perceptron AlgorithmWith these ideas in place, we are ready to introduce the Perceptron algorithm.PERCEPTRON ALGORITHMInitialize w1= 0At each round t ∈ {1, 2, ···}• Receive input xt• If wt· xt≥ 0, predict +1, else predict −1• If there is a mistake (i.e., if yt(wt· xt) < 0), set wt+1← wt+ yt· xt, else set wt+1← wt.The general intuition behind the algorithm is that every time it makes a mistake on a positive example,it shifts the weight vector toward the input point, whereas every time it makes a mistake on a negative point,it shifts the weight vector away from that point.Figure 3Now we prove a mistake bound for the above algorithm. We assume that the perfect target functionis represented by a normal vector u of unit length; this is without loss of generality since normalizing theweight vector doesn’t change the decision boundary. However, we also make a stronger assumption that theperfect target function we consider has a margin of at least γ. (Exercise: Show how an adversary could forceany algorithm to make an unbounded number of mistakes if we didn’t have a margin assumption like this.)Theorem 1. Suppose there exists a u of unit length and values γ > 0 and D > 0 such that ∀t yt(xt·u) ≥ γand kxtk ≤ D. Then, the number of mistakes made by the Perceptron algorithm is no more than (D/γ)2.Let m(i) be the round in which the ith mistake is made. Define m(0) = 0.3Lemma 1. For all mistakes k, wm(k)+1· u ≥ kγ.Proof: We prove by induction on the number of mistakes k. For the base case, k = 0, note that since theinitial weight vector w1is all 0s, we have wm(0)+1· u = w1· u = 0.For the induction hypothesis, assume that the above statement holds true for all k < i.For the induction step, consider wm(i)+1. We havewm(i)+1· u = (wm(i)+ ym(i)xm(i)) · u= wm(i)· u + ym(i)(xm(i)· u).The first equality comes from the Perceptron update rule. We did make a mistake on round m(i), so theweights at round m(i) + 1 can be computed by applying the update rule to the weights at round m(i).Now, we know that we did not make a mistake between round m(i − 1) + 1 and round m(i). Since thePerceptron only updates weights when there is a mistake, we have wm(i)·u = wm(i−1)+1·u. We also haveym(i)(xm(i)· u) ≥ γ, by the margin requirement in the statement of the theorem. Thus, we have,wm(i)+1· u ≥ wm(i−1)+1· u + γ≥ iγ,where the last inequality follows from the induction hypothesis.Lemma 2. For all mistakes k, kwm(k)+1k2≤ kD2.Proof: We again prove this lemma by induction on the number of mistakes k. For the base case, k = 0, wehave kwm(0)+1k2= kw1k2= 0.Now, let us assume that the statement is true for all k < i.For the induction step, note thatkwm(i)+1k2= kwm(i)+ ym(i)xm(i)k2= kwm(i)k2+ kxm(i)k2+ 2ym(i)(xm(i)· wm(i)),where the first equality holds for the same reason as in Lemma 1 above. Now, as above, we have kwm(i)k2=kwm(i−1)+1k2. Further, by the bound on the lengths of vectors in the theorem statement, we have kxm(i)k2≤D2. For the third term in the expression above, note that as there was a mistake in round m(i), our predictionof the label did not match with the correct label. Thus, ym(i)(xm(i)· wm(i)) < 0. Therefore, we have,kwm(i)+1k2≤ kwm(i−1)+1k2+ D2≤ iD2.Here, the last inequality follows from the induction hypothesis.

View Full Document


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2 out of 5 pages.

UCLA COMSCI 260 - Lecture 8

Sign up for free to view:

Please select your school