DOC PREVIEW
MIT 6 867 - Perceptron, Convergence and Generalization

This preview shows page 1-2 out of 5 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 5 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 5 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 5 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

� � 1 6.867 Machine learning, lecture 2 (Jaakkola) Perceptron, convergence, and generalization Recall that we are dealing with linear classifiers through origin, i.e., f(x; θ) = sign θT x (1) where θ ∈ Rd specifies the parameters that we have to estimate on the basis of training examples (images) x1, . . . , xn and labels y1, . . . , yn. We will use the p erceptron algorithm to solve the estimation task. Let k denote the number of parameter updates we have performed and θ(k) the parameter vector after k updates. Initially k = 0 and θ(k) = 0. The algorithm then cycles through all the training instances (xt, yt) and updates the parameters only in response to mistakes, i.e., when the label is predicted incorrectly. More precisely, we set θ(k+1) = θ(k) + ytxt when yt(θ(k))T xt < 0 (mistake), and otherwise leave the parameters unchanged. Convergence in a finite number of updates Let’s now show that the perceptron algorithm indeed convergences in a finite number of updates. The same analysis will also help us understand how the linear classifier generalizes to unse en images. To this end, we will assume that all the (training) images have b ounded Euclidean norms, i.e., �xt� ≤ R for all t and some finite R. This is cle arly the case for any pixel images with bounded intensity values. We also make a much stronger assumption that there exists a linear classifier in our class with finite parameter values that correctly classifies all the (training) images. More precisely, we assume that there is some γ > 0 such that yt(θ∗)T xt ≥ γ for all t = 1, . . . , n . The additional number γ > 0 is used to ensure that each example is classified correctly with a finite margin. The convergence proof is based on combining two results: 1) we will show that the inner product (θ∗)T θ(k) increases at least linearly with each update, and 2) the squared norm �θ(k)�2 increases at most linearly in the number of updates k. By combining the two we can show that the cosine of the angle between θ(k) and θ∗ has to increase by a finite increment due to each update. Since cosine is bounded by one, it follows that we can only make a finite number of updates. Part 1: we simply take the inner product (θ∗)T θ(k) before and after each update. When making the kth update, say due to a mistake on image xt, we get (θ∗)T θ(k) = (θ∗)T θ(k−1) + yt(θ∗)T xt ≥ (θ∗)T θ(k−1) + γ (2) Cite as: Tommi Jaakkola, course materials for 6.867 Machine Learning, Fall 2006. MIT OpenCourseWare(http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].2 6.867 Machine learning, lecture 2 (Jaakkola) since, by assumption, yt(θ∗)T xt ≥ γ for all t (θ∗ is always correct). Thus, after k updates, (θ∗)T θ(k) ≥ kγ (3) Part 2: Our second claim follows simply from the fact that updates are made only on mistakes : �θ(k) 2 = �θ(k−1) + ytxt�2 (4) �= �θ(k−1) 2 + 2yt(θ(k−1))T xt + �xt�2 (5) �θ(k−1)�2 2≤�θ(k−1)�2 + �xt� (6) ≤ � + R2 (7) since yt(θ(k−1))T xt < 0 whenever an update is made and, by assumption, �xt� ≤ R. Thus, �θ(k)�2 ≤ kR2 (8) We can now combine parts 1) and 2) to b ound the cosine of the angle between θ∗ and θ(k): kγ kγ cos(θ∗, θ(k)) = (θ∗)T θ(k) 1) 2) (9) �θ(k)��θ∗�≥�θ(k)��θ∗�≥√kR2 �θ∗� Since cosine is bounded by one, we get kγ R2 2 1 ≥√kR2 �θ∗� or k ≤�γθ2 ∗� (10) Margin and geometry It is worthwhile to understand this result a bit further. For example, does �θ∗�2/γ2 relate to how difficult the classification problem is? Indeed, it does. We claim that its inverse, i.e., γ/�θ∗� is the smallest distance in the image space from any example (image) to the decision boundary specified by θ∗. In other words, it serves as a measure of how we ll the two classes of images are separated (by a linear boundary). We will call this the geometric margin or γgeom (see figure 1). γ−1 is then a fair measure of how difficult the problem is: geom the smaller the geometric margin that separates the training images, the more difficult the problem. To calculate γgeom we measure the distance from the decision boundary θ∗T x = 0 to one of the images xt for which ytθ∗T xt = γ. Since θ∗ specifies the normal to the decision boundary, Cite as: Tommi Jaakkola, course materials for 6.867 Machine Learning, Fall 2006. MIT OpenCourseWare(http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].3 6.867 Machine learning, lecture 2 (Jaakkola) θ∗ x x x decision boundary θ∗T x = 0 x xx γgeom Figure 1: Geometric margin the shortest path from the boundary to the image xt will be parallel to the normal. The image for which ytθ∗T xt = γ is therefore among those closest to the boundary. Now, let’s define a line segment from x(0) = xt, parallel to θ∗, towards the boundary. This is given by ytθ∗ x(ξ) = x(0) − ξ (11) �θ∗� where ξ defines the length of the line segment since it multiplies a unit length vector. It remains to find the value of ξ such that θ∗T x(ξ) = 0, or, equivalently, ytθ∗T x(ξ) = 0. This is the point where the segment hits the decision boundary. Thus � � ytθ∗T x(ξ) = = = = ytθ∗T x(0) − ξ ytθ∗ �θ∗� ytθ∗T � xt − ξ ytθ∗ �θ∗� � ytθ∗T xt − ξ �θ∗�2 �θ∗� γ − ξ�θ∗� = 0 (12) (13) (14) (15) implying that the distance is exactly ξ = γ/�θ∗� as claimed. As a result, the bound on the number of perceptron updates can be written more succinctly in terms of the geometric Cite as: Tommi Jaakkola, course materials for 6.867 Machine Learning, Fall 2006. MIT OpenCourseWare(http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].4 6.867 Machine learning, lecture 2 (Jaakkola) margin γgeom (distance to the boundary): � �2R k ≤ (16) γgeom with the understanding that γgeom is the largest geometric margin that could be achieved by a linear classifier for this problem. Note that the result does not depend (directly) on the dime nsion d of the examples, nor the number of training examples n. It is nevertheless � �2 Rtempting to interpret γgeom as a measure of difficulty (or


View Full Document

MIT 6 867 - Perceptron, Convergence and Generalization

Download Perceptron, Convergence and Generalization
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Perceptron, Convergence and Generalization and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Perceptron, Convergence and Generalization 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?