DOC PREVIEW
MIT 6 867 - Support vector machine revisited

This preview shows page 1-2 out of 5 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 5 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 5 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 5 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

� � � 6.867 Machine learning, lecture 8 (Jaakkola) 1 Lecture topics: • Support vector machine and kernels • Kernel optimization, selection Support vector machine revisited Our task here is to first turn the support vector machine into its dual form where the exam-ples only appear in inner products. To this end, assume we have mapped the examples into feature vectors φ(x) of dimension d and that the resulting training set (φ(x1), y1), . . . , (φ(xn), yn) is linearly separable. Finding the maximum margin linear separator in the feature space now corresponds to solving minimize �θ�2/2 subject to yt(θT φ(xt) + θ0) ≥ 1, t = 1, . . . , n (1) We will discuss later on how slack variables affect the resulting kernel (dual) form. They merely complicate the derivation without changing the procedure. Optimization problems of the above type (convex, linear constraints) can be turned into their dual form by means of Lagrange multipliers. Specifically, we introduce a non-negative scalar parameter αt for each inequality constraint and cast the estimation problem in terms of θ and α = {α1, . . . , αn}: n� � J(θ, θ0; α) = �θ�2/2 − αt yt(θT φ(xt) + θ0) − 1 (2) t=1 The original minimization problem for θ and θ0 is recovered by maximizing J(θ, θ0; α) with respect to α. In other words, J(θ, θ0) = max J(θ, θ0; α) (3) α≥0 where α ≥ 0 means that all the components αt are non-negative. Let’s try to see first that J(θ, θ0) really is equivalent to the original problem. Suppose we set θ and θ0 such that at least one of the constraints, say the one corresponding to (xi, yi), is violated. In that case − αi yi(θT φ(xi) + θ0) − 1 > 0 (4) for any αi > 0. We can then set αi = ∞ to obtain J(θ, θ0) = ∞. You can think of the Lagrange multipliers playing an adversarial role to enforce the margin constraints. More Cite as: Tommi Jaakkola, course materials for 6.867 Machine Learning, Fall 2006. MIT OpenCourseWare(http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].� � � � � 6.867 Machine learning, lecture 8 (Jaakkola) 2 formally, J(θ, θ0) = �θ�2/2 if yt(θT φ(xt) + θ0) ≥ 1, t = 1, . . . , n (5) ∞, otherwise So the minimizing θ and θ0 are therefore those that satisfy the constraints. On the basis of a general set of criteria governing the optimality when dealing with Lagrange multipliers, criteria known as Slater conditions, we can actually switch the maximizing over α and the minimization over {θ, θ0} and get the same answer: min max J(θ, θ0; α) = max min J(θ, θ0; α) (6) θ,θ0 α≥0 α≥0 θ,θ0 The left hand side, equivalent to minimizing Eq.(5), is known as the primal form, while the right hand side is the dual form. Let’s solve the right hand side by first obtaining θ and θ0 as a function of the Lagrange multipliers (and the data). To this end nd � dθ0 J(θ, θ0; α) = − αtyt = 0 (7) t=1 nd � dθJ(θ, θ0; α) = θ − αtytφ(xt) = 0 (8) t=1 So, again the solution for θ is in the span of the feature vectors corresponding to the training examples. Substituting this form of the solution for θ back into the objective, and taking into account the constraint corresponding to the optimal θ0, we get J(α) = min J(θ, θ0; α) (9) θ,θ0�� � � � n n n n = t=1 αt − (1/2) i=1 j=1 αiαj yiyj[φ(xi)T φ(xj )], if t=1 αtyt = 0 (10) −∞, otherwise The dual form of the solution is therefore obtained by maximizing n n nαt − (1/2) αiαj yiyj [φ(xi)T φ(xj )], (11) t=1 i=1 j=1 nsubject to αt ≥ 0, αtyt = 0 (12) t=1 This is the dual or kernel form of the support vector machine, and is also a quadratic optimization problem. The constraints are simpler, however. Moreover, the dimension of Cite as: Tommi Jaakkola, course materials for 6.867 Machine Learning, Fall 2006. MIT OpenCourseWare(http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].� � � 6.867 Machine learning, lecture 8 (Jaakkola) 3 the input vectors does not appear explicitly as part of the optimization problem. It is formulated solely on the basis of the Gram matrix: ⎡ ⎤ φ(x1)T φ(x1) · · · φ(x1)T φ(xn) K = ⎣ · · · · · · · · · ⎦ (13) φ(xn)T φ(x1) . . . φ(xn)T φ(xn) We have already seen that the maximum margin hyperplane can be constructed on the basis of only a subset of the training examples. This should also also in terms of the feature ve ctors. How will this be manifested in the ˆαt’s? Many of them will be exactly zero due to the optimization. In fact, they are non-zero only for examples (feature vectors) that are support vectors. Once we have solved for ˆαt, we can classify any new example according to the discriminant function yˆ(x) = θˆT φ(x) + θˆ0 (14) n= αˆtyt[φ(xt)T φ(x)] + θˆ0 (15) t=1 = αˆtyt[φ(xt)T φ(x)] + θˆ0 (16) t∈SV where SV is the set of support vectors corresponding to non-zero values of αt. We don’t know which examples (feature vectors) become as support vectors until we have solved the optimization problem. Moreover, the identity of the support vectors will depend on the feature mapping or the kernel function. But what is θˆ0? It appeared to drop out of the optimization problem. We can set θ0 after solving for ˆαt by looking at the support vectors. Indeed, for all i ∈ SV we should have yi(θˆT φ(xi) + θˆ0) = yi αˆt[φ(xt)T φ(xi)] + yiθˆ0 = 1 (17) t∈SV from which we can easily solve for θˆ0. In principle, selecting any support vector would suffice but since we typically solve the quadratic program over αt’s only up to some resolution, these constraints may not be satisfied with equality. It is therefore advisable to construct θˆ0 as the median value of the solutions implied by the support vectors. What is the geometric margin we attain with some kernel function K(x, x�) = φ(x)T φ(x�)? Cite as: Tommi Jaakkola, course materials for 6.867 Machine Learning, Fall 2006. MIT OpenCourseWare(http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].� �� � � � � � � 6.867 Machine learning, lecture 8 (Jaakkola) 4 It is still 1/�θˆ�. In a kernel form n n−1/2 γˆgeom = αˆiαˆjyiyj K(xi, xj ) (18) i=1 j=1 Would it make


View Full Document

MIT 6 867 - Support vector machine revisited

Download Support vector machine revisited
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Support vector machine revisited and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Support vector machine revisited 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?