UCLA COMSCI 260 - Perceptron Algorithm - D2910556

Home> Schools> University of California, Los Angeles> Computer Science (COMSCI) > COMSCI 260> Perceptron Algorithm

DOC PREVIEW

UCLA COMSCI 260 - Perceptron Algorithm

School name University of California, Los Angeles

Course Comsci 260- Machine Learning Theory

Pages 4

This preview shows page 1 out of 4 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 4 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 4 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

CS269: Machine Learning TheoryLecture 7: Perceptron AlgorithmOctober 18, 2010Lecturer: Jennifer Wortman VaughanScribe: Shankar Garikapati and Akshay WadiaIn this lecture, we consider the problem of learning the class of linear separators in the online learningframework. Recall from the previous lecture that an n-dimensional linear separator through the origin canbe represented by an n-dimensional vector u. For any vector x, the label of x is +1 if u · x ≥ 0, and −1otherwise. In this lecture, we will study the Perceptron algorithm and analyze its mistake bound.NOTATION. Vectors are represented by lower case bold letters like u, w, x, etc., while scalars are normallower case letters. The inner product between two vectors u and x is denoted by u · x. kuk represents thelength of the vector u.1 Perceptron AlgorithmBefore describing the Perceptron algorithm, we review the notion of margin. We have the following fromthe previous lecture.Definition 1. Given a linear separator u the margin γtof xtwith label yt∈ {+1, −1 } is the distance ofxtfrom the separator. That is,γt= yt(u · xt).Now we present the Perceptron algorithm.PERCEPTRON ALGORITHM1. Initialize t := 1 and w1to be all 0 weight vector.2. For each xtpredict +1 if wt· xt≥ 0, else −13. If there is a mistake (i.e., if yt(wt·xt) < 0), set wt+1← wt+ yt·xt. Else, set wt+1← wt.4. t ← t + 1Now we prove the mistake bound for the above algorithm.Theorem 1. Suppose there exists a u such that kuk = 1, and γ > 0, such that ∀t yt(xt·u) ≥ γ, and thereexists a real number D such that ∀t kxtk ≤ D. Then, the number of mistakes made by the (unnormalized)Perceptron is ≤Dγ2.1Proof: Let m(i) be the round in which the ithmistake is made. Define m(0) = 0. Let k be the total numberof mistakes made in the run of Perceptron algorithm. Finally, let u be the target separator that we are tryingto learn. We prove the theorem through the following two lemmas.Lemma 1. wm(k)+1· u ≥ kγ.Proof: We prove by induction on the number of mistakes that ∀i, wm(i)+1· u ≥ iγ.For the base case, note that as the initial weight vector w1is all 0s, we have w1· u = 0.For the induction hypothesis, assume that the above statement holds true for all values less than i.For the induction step, consider wm(i)+1. We have,wm(i)+1· u = (wm(i)+ ym(i)xm(i)) · u= wm(i)· u + ym(i)(xm(i)· u).The first equality comes from the Perceptron update rule. We did make a mistake on round m(i), so theweights at round m(i) + 1 can be computed by applying the update rule to the weights at round m(i).Now, we know that we did not make a mistake between round m(i − 1) + 1 and round m(i). Since thePerceptron only updates weights when there is a mistake, we have wm(i)·u = wm(i−1)+1·u. We also haveym(i)(xm(i)· u) ≥ γ, by the margin requirement in the statement of the theorem. Thus, we have,wm(i)+1· u ≥ wm(i−1)+1· u + γ≥ iγ.The last inequality follows from the induction hypothesis.Lemma 2. kwm(k)+1k2≤ kD2.Proof: We prove by induction on the number of mistakes that ∀i.kwm(i)+1k2≤ iD2.For the base case, we have, kwm(0)+1k2= kw1k2= 0.Let us assume that the statement is true for all values less than some i.For the induction step, note that,kwm(i)+1k2= kwm(i)+ ym(i)xm(i)k2= kwm(i)k2+ kxm(i)k2+ 2ym(i)(xm(i)· wm(i)),where the first equality holds for the same reason as in Lemma 1 above. Now, as above, we havekwm(i)k2= kwm(i−1)+1k2. Further, by the bound on the lengths of vectors in the theorem statement, wehave kxm(i)k2≤ D2. For the third term in the expression above, note that as there was a mistake in roundm(i), our prediction of the label did not match with the correct label. Thus, ym(i)(xm(i)· wm(i)) < 0.Therefore, we have,kwm(i)+1k2≤ kwm(i−1)+1k2+ D2≤ iD2.Here, the last inequality follows from induction. This proves the lemma.2To prove Theorem 1, we recall the following simple fact from linear algebra: if z and u are two vectors suchthat kuk = 1 and θ is the angle between z and u, then we have,z · u = kzkkukcos(θ)≤ kzkkuk = kzkPutting all the above together, we have,D√k ≥ kwm(k)+1k≥ wm(k)+1· u≥ kγ.Thus, k ≤ (D/γ)2. In the above, the first inequality is from Lemma 2, the second from the fact above,and the third from Lemma 1.2 The “Normalized” Perceptron AlgorithmIn this section, we consider a variation of the Perceptron algorithm. This version, called the normalizedversion, differs from the previous version only in its update rule for the weight vector wtwhen there is amistake. As all other steps are the same, we simply state the new update rule:NORMALIZED UPDATE RULE.If there is a mistake, that is, if yt(wt· xt) ≤ 0, then setwt+1← wt+ ytxtkxtk.Else, set wt+1← wt.For the normalized Perceptron, we have the following mistake bound.Theorem 2. Suppose there exists a u, kuk = 1, and γn> 0 such that ∀t, yt(xtkxk·u) ≥ γn. Then the numberof mistakes made by the normalized Perceptron is ≤1γn2.Proof Sketch: A run of the normalized Perceptron can be thought of as a run of the (unnormalized) Percep-tron algorithm with a preprocessing step: we can imagine that before the Perceptron algorithm is run, for allrounds t, the vector xtis normalized toxtkxtk(the margin bound is suitably modified). Now, the behavior ofthe two algorithms in terms of classification of points is identical: all points are labeled consistently by thetwo algorithms, and the mistakes also occur at the same points. Thus, we can use Theorem 1 to derive a mis-take bound for the normalized Perceptron: in the pre-processing version, as all vectors xtare normalized,D = 1. This gives the bound k ≤1γ2n.3Advantage of Normalized Perceptron. In this sub-section we would like to show that in some situationsthe normalized Perceptron might perform better (in terms of mistakes made) than the Perceptron algorithm.Note that for the Perceptron algorithm, we don’t actually need to know the margin bound γ to run thealgorithm. Consider a particular run of the algorithm. After the run define,γ := mintyt(xt· u).This margin bound will be consistent with the run of the algorithm. Similarly, we can retrospectively definethe normalized margin bound for the normalized Perceptron. We have,γn:= mintyt(xtkxtk· u)= mintyt(xt· u)1kxtk≥mintyt(xt, u)mint1kxtk≥γD.Thus,1γ2n≤Dγ2.The above statement in a way shows that although in some cases the normalized margin assumption mayseem less natural than the original

View Full Document


School:
Email:
New Password:
Confirm Password:

This preview shows page 1 out of 4 pages.

UCLA COMSCI 260 - Perceptron Algorithm

Sign up for free to view:

Please select your school