MIT 6 867 - Perceptron, Convergence and Generalization - D913477

Home> Schools> Massachusetts Institute of Technology> Electrical Engineering and Computer Science (6) > 6 867> Perceptron, Convergence and Generalization

DOC PREVIEW

MIT 6 867 - Perceptron, Convergence and Generalization

School name Massachusetts Institute of Technology

Course 6 867- Machine Learning

Pages 5

This preview shows page 1-2 out of 5 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 5 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 5 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 5 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

� � 1 6.867 Machine learning, lecture 2 (Jaakkola) Perceptron, convergence, and generalization Recall that we are dealing with linear classiﬁers through origin, i.e., f(x; θ) = sign θT x (1) where θ ∈ Rd speciﬁes the parameters that we have to estimate on the basis of training examples (images) x1, . . . , xn and labels y1, . . . , yn. We will use the p erceptron algorithm to solve the estimation task. Let k denote the number of parameter updates we have performed and θ(k) the parameter vector after k updates. Initially k = 0 and θ(k) = 0. The algorithm then cycles through all the training instances (xt, yt) and updates the parameters only in response to mistakes, i.e., when the label is predicted incorrectly. More precisely, we set θ(k+1) = θ(k) + ytxt when yt(θ(k))T xt < 0 (mistake), and otherwise leave the parameters unchanged. Convergence in a ﬁnite number of updates Let’s now show that the perceptron algorithm indeed convergences in a ﬁnite number of updates. The same analysis will also help us understand how the linear classiﬁer generalizes to unse en images. To this end, we will assume that all the (training) images have b ounded Euclidean norms, i.e., �xt� ≤ R for all t and some ﬁnite R. This is cle arly the case for any pixel images with bounded intensity values. We also make a much stronger assumption that there exists a linear classiﬁer in our class with ﬁnite parameter values that correctly classiﬁes all the (training) images. More precisely, we assume that there is some γ > 0 such that yt(θ∗)T xt ≥ γ for all t = 1, . . . , n . The additional number γ > 0 is used to ensure that each example is classiﬁed correctly with a ﬁnite margin. The convergence proof is based on combining two results: 1) we will show that the inner product (θ∗)T θ(k) increases at least linearly with each update, and 2) the squared norm �θ(k)�2 increases at most linearly in the number of updates k. By combining the two we can show that the cosine of the angle between θ(k) and θ∗ has to increase by a ﬁnite increment due to each update. Since cosine is bounded by one, it follows that we can only make a ﬁnite number of updates. Part 1: we simply take the inner product (θ∗)T θ(k) before and after each update. When making the kth update, say due to a mistake on image xt, we get (θ∗)T θ(k) = (θ∗)T θ(k−1) + yt(θ∗)T xt ≥ (θ∗)T θ(k−1) + γ (2) Cite as: Tommi Jaakkola, course materials for 6.867 Machine Learning, Fall 2006. MIT OpenCourseWare(http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].2 6.867 Machine learning, lecture 2 (Jaakkola) since, by assumption, yt(θ∗)T xt ≥ γ for all t (θ∗ is always correct). Thus, after k updates, (θ∗)T θ(k) ≥ kγ (3) Part 2: Our second claim follows simply from the fact that updates are made only on mistakes : �θ(k) 2 = �θ(k−1) + ytxt�2 (4) �= �θ(k−1) 2 + 2yt(θ(k−1))T xt + �xt�2 (5) �θ(k−1)�2 2≤�θ(k−1)�2 + �xt� (6) ≤ � + R2 (7) since yt(θ(k−1))T xt < 0 whenever an update is made and, by assumption, �xt� ≤ R. Thus, �θ(k)�2 ≤ kR2 (8) We can now combine parts 1) and 2) to b ound the cosine of the angle between θ∗ and θ(k): kγ kγ cos(θ∗, θ(k)) = (θ∗)T θ(k) 1) 2) (9) �θ(k)��θ∗�≥�θ(k)��θ∗�≥√kR2 �θ∗� Since cosine is bounded by one, we get kγ R2 2 1 ≥√kR2 �θ∗� or k ≤�γθ2 ∗� (10) Margin and geometry It is worthwhile to understand this result a bit further. For example, does �θ∗�2/γ2 relate to how diﬃcult the classiﬁcation problem is? Indeed, it does. We claim that its inverse, i.e., γ/�θ∗� is the smallest distance in the image space from any example (image) to the decision boundary speciﬁed by θ∗. In other words, it serves as a measure of how we ll the two classes of images are separated (by a linear boundary). We will call this the geometric margin or γgeom (see ﬁgure 1). γ−1 is then a fair measure of how diﬃcult the problem is: geom the smaller the geometric margin that separates the training images, the more diﬃcult the problem. To calculate γgeom we measure the distance from the decision boundary θ∗T x = 0 to one of the images xt for which ytθ∗T xt = γ. Since θ∗ speciﬁes the normal to the decision boundary, Cite as: Tommi Jaakkola, course materials for 6.867 Machine Learning, Fall 2006. MIT OpenCourseWare(http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].3 6.867 Machine learning, lecture 2 (Jaakkola) θ∗ x x x decision boundary θ∗T x = 0 x xx γgeom Figure 1: Geometric margin the shortest path from the boundary to the image xt will be parallel to the normal. The image for which ytθ∗T xt = γ is therefore among those closest to the boundary. Now, let’s deﬁne a line segment from x(0) = xt, parallel to θ∗, towards the boundary. This is given by ytθ∗ x(ξ) = x(0) − ξ (11) �θ∗� where ξ deﬁnes the length of the line segment since it multiplies a unit length vector. It remains to ﬁnd the value of ξ such that θ∗T x(ξ) = 0, or, equivalently, ytθ∗T x(ξ) = 0. This is the point where the segment hits the decision boundary. Thus � � ytθ∗T x(ξ) = = = = ytθ∗T x(0) − ξ ytθ∗ �θ∗� ytθ∗T � xt − ξ ytθ∗ �θ∗� � ytθ∗T xt − ξ �θ∗�2 �θ∗� γ − ξ�θ∗� = 0 (12) (13) (14) (15) implying that the distance is exactly ξ = γ/�θ∗� as claimed. As a result, the bound on the number of perceptron updates can be written more succinctly in terms of the geometric Cite as: Tommi Jaakkola, course materials for 6.867 Machine Learning, Fall 2006. MIT OpenCourseWare(http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].4 6.867 Machine learning, lecture 2 (Jaakkola) margin γgeom (distance to the boundary): � �2R k ≤ (16) γgeom with the understanding that γgeom is the largest geometric margin that could be achieved by a linear classiﬁer for this problem. Note that the result does not depend (directly) on the dime nsion d of the examples, nor the number of training examples n. It is nevertheless � �2 Rtempting to interpret γgeom as a measure of diﬃculty (or

View Full Document