SVM as a Convex Optimization ProblemLeon GuCSD, CMUConvex OptimizationIConvex set: the line segment between any two points lies in the set.IConvex function: the line segment between any two points (x, f(x))and (y, f(y)) lies on or above the graph of f .IConvex optimizationminimize f0(x) (1)s.t. fi(x) ≤ 0 i = 1, . . . , m (2)hi(x) = 0 i = 1, . . . , p (3)If0and ficonvex, hilinear.Iconvex objective function, convex domain (feasible set).Iany local optimum is also a global optimum.IOperations preserve convexityI(for convex sets) intersection, affine transformation,perspective transformation, ...I(for convex functions) nonnegative weighted sum, maximumand supremum, composition with affine functions,composition with monotonic convex/concave functions, ...Optimal Separating HyperplaneSupp os e that our data set {xi, yi}Ni=1is linear separable. Define ahyperplane by{x : f(x) = βTx + β0= βT(x − x0) = 0} where kβk = 1.If(x) is the sign distance to the hyperplane.Iwe can define a classification rule induced by f(x): sgn[βT(x − x0)];Define the margin of f(x) to b e the minimal yf(x) through the dataC = miniyif(xi)A optimal separating hyperplane is the hyperplane that maximizes t hemargin,maxβ,β0,kβk=1C, s.t. yi(βTxi+ β0) ≥ C, i = 1, . . . , NWe can get rid of the norm constraint on β,1kβkyi(βTxi+ β0) ≥ Cand arbitrarily set kβk = 1/C, then we can rephrase the problem asminβ,β0kβk , s.t. yi(βTxi+ β0) ≥ 1, i = 1, . . . , NThis is a convex optimization problem.Soft Margin SVMThe data is not always perfect. We need to extend optimal se paratinghyperplane to non-separable cases. The trick is to relax the marginconstraints by introducing some “slack” variables.minimize kβk over β, β0(4)s.t. yi(βTxi+ β0) ≥ 1 − ξi, i = 1, . . . , N (5)ξi≥ 0;NXi=1ξi≤ Z (6)Istill convex.Iξi> 1 – misclassificationξi> 0 – the data is correctly classified but lies in the margin.IZ is a tuning parameter.How to solve it? Use Lagrange/duality theory.Lagrangian TheoryLagrangian theory characterizes the solution of a constrainedoptimization problem. Recall the primal problem:minimize f0(x) (7)s.t. fi(x) ≤ 0 i = 1, . . . , m (8)hi(x) = 0 i = 1, . . . , p (9)The stationary points are given bydf0(x)dx+mXi=1λidfi(x)dx+pXi=1νidhi(x)dx= 0where λ, ν are free parameters called Lagrange multipliers. Accordingly,we define Lagrangian prime function (or Lagrangian) asL(x, λ, ν) = f0(x) +mXi=1λifi(x) +pXi=1νihi(x).We define Lagrangian dual function g(λ, ν) asg(λ, ν) = infx∈XL(x, λ, ν).The so-called Lagrangian dual problem is the following:maximize g(λ, ν) (10)s.t. λ > 0. (11)The weak duality theorem saysg(λ, ν) ≤ f0(x∗) for all λ and νIn other words, maximizing g(λ, ν) over λ and ν produce a bound onf0(x∗) (Note that g(λ, ν) is piecewise linear and convex). The differencebetween g(λ∗, ν∗) and f0(x∗) is called the “duality gap”.K.K.T. ConditionsSlater’s Theorem (Strong Duality Theorem) says: if the constraintfunctions are affine, the duality gap is zero.Then, K.K.T. conditions provide necessary and sufficient conditions fora point x∗to be an optimum∂L(x,λ∗,ν∗)∂xx∗= 0 first-order derivative of optimalityλ∗ifi(x∗) = 0 complementary slackness conditionsλ∗i≥ 0 dual constraintsfi(x∗) ≤ 0 prime constraintshi(x∗) = 0 prime constraintsRemarks: complementary slackness conditions are directly related tosupport vectors.The Dual ProblemRecall the prime problem (soft-margin SVM)minimize kβk over β, β0(12)s.t. yi(βTxi+ β0) ≥ 1 − ξi, i = 1, . . . , N (13)ξi≥ 0;NXi=1ξi≤ Z (14)Obviously strong duality holds. So we can find its dual problem by thefollowing steps1. Define Lagrange primal function (and Lagrange multipliers).2. Take the first-order derivatives w.r.t. β, β0and ξi, and set to zero.3. Substitute the results into the primal function.Maximize LD=NXi=1αi−12NXi=1NXi0=1αiαi0yiyi0hxi, xi0i (15)s.t. 0 ≤ αi≤ γ, i = 1, . . . , N (16)NXi=1αiyi= 0 (17)Solution : (18)ˆβ =NXi=1ˆαiyixi(19)f(x) = βTx + β0=NXi=1ˆαiyihxi, xi + β0(20)ISparse representation: the separating hyperplane f (x) is spannedthose data points i where αi6= 0, called Supp ort Vectors.Ifollows directly from complementary slackness conditions:αiyi(βTxi+ β0) + ξi− 1 = 0IBoth the estimation and the e valuation of f(x) only involve dotproduct
View Full Document