UCLA COMSCI 260 - SVMs and Kernels - D2096569

Home> Schools> University of California, Los Angeles> Computer Science (COMSCI) > COMSCI 260> SVMs and Kernels

DOC PREVIEW

UCLA COMSCI 260 - SVMs and Kernels

School name University of California, Los Angeles

Course Comsci 260- Machine Learning Theory

Pages 8

This preview shows page 1-2-3 out of 8 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 8 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 8 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 8 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 8 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

CS269: Machine Learning TheoryLecture 16: SVMs and KernelsNovember 17, 2010Lecturer: Jennifer Wortman VaughanScribe: Jason Au, Ling Fang, Kwanho LeeToday, we will continue on the topic of support vector machines, finishing our derivation of the dual opti-mization problem. Next, we will go over how to solve the dual optimization and examine the case in whichthe data is not linearly separable. For dealing with such a case, we introduce two approaches: the SoftMargin Approach and the Kernel Trick.1 The Primal and Dual ProblemsWhen we talked about the analysis of boosting, we found that having a large margin led to small gener-alization error. In the last lecture, we showed that we could maximize the margin directly by framing themaximization as a convex optimization problem. In section 1.1, we will derive the dual of this problem,which is more convenient to compute and can be used with kernels.In the previous lecture, we introduced the standard form of the optimization problem:min~w,b12||~w||2s.t. 1 − yi( ~w ·~xi+ b) ≤ 0, i = 1, ..., mIn the previous lecture, we showed that solving a convex optimization was equivalent to solving the follow-ing optimization of the Lagrangian function:min~w,bmax~α;αi≥0L( ~w, b, ~α) where L is the Lagrangian functionWe dropped the~β in the Lagrangian function from the previous lecture because~β is the vector of the equal-ity constraints, but here, we are only dealing with inequalities, which corresponds to ~α.We previously showed that the solution to the primal problem is equivalent to the solution to the dualproblem if they satisfy the following primal-dual equivalence conditions. First, we need a convex objectivefunction and in our case, it is12||~w||2. Second, we need convex inequality constraints gi, which are 1−yi( ~w ·~xi+ b) for i = 1, ..., m. The last condition states that for each inequality gi, there exists some value of thevariables that make gistrictly less than 0. Since these conditions are satisfied for the optimization problemwe defined, we know that we can solve the dual version of the problem instead of the primal problem andthe solutions will be equivalent.11.1 Deriving The Dual ProblemThe dual problem is defined as follows:max~α;αi≥0min~w,bL( ~w, b, ~α)We will derive a more convenient form in three steps. The first step is to calculate the value of the Lagrangianfunction by plugging in our objective function and constraints:L( ~w, b, ~α) = f( ~w, b)|{z }objective function+mXi=1αigi( ~w, b)| {z }inequality constraints=12||~w||2+mXi=1αi(1 − yi( ~w ·~xi+ b))The next step is to find the minimum value of the Lagrangian. Because the Lagrangian is convex, theminimum occurs when the partial derivatives are equal to 0. Thus, we can derive the following:∇~wL = 0 → ~w −mXi=1αiyi~xi= 0 → ~w =mXi=1αiyi~xi∂L∂b= 0 →mXi=1αiyi= 0By plugging the values into our Lagrangian equation which we derived in the first step, we can verify:min~w,bL( ~w, b, ~α) =mXi=1αi−12mXi,j=1yiyjαiαj~xi·~xjBy maximizing this expression with respect to ~α and applying the constraint we derived from the partialderivatives, we obtain the dual form of the problem.max~αmXi=1αi−12mXi,j=1yiyjαiαj~xi·~xjs.t. αi≥ 0mXi=1αiyi= 0This version of the problem is easy to solve. Additionally, it only depends on the ~xithrough dot products ofthe form ~xi·~xj, so it can be used with the kernel trick, which we will introduce in the last section.22 KKT ConditionsIn the last class, we stated the KKT conditions. Recall that when the primal-dual equivalence conditions wementioned earlier are satisfied, we are guaranteed that some set of values ~w∗, b∗, ~α∗, is a solution to both theprimal and dual problems if and only if the KKT conditions are met. For our problem, the KKT conditionsimply the following restrictions on solutions:1. Stationarity:From taking the partial derivatives above, we know this gives us:~w∗=mXi=1α∗iyi~ximXi=1α∗iyi= 0Therefore, by plugging in the value of ~w∗, we can derive that the linear separator we output must beof the formh(~x) = sign( ~w∗·~x + b∗) = sign mXi=1α∗iyi~xi·~x + b∗!Again, evaluating h(~x) depends only on the dot product ~xi· ~x, but does not depend on ~xior ~x in anyother way. This will be important when we discuss kernels.2. Primal Feasibility:The first primal feasibility condition we mentioned in the previous lecture is irrelevant here becausethere are no equality constraints for the original primal problem. The primal feasibility condition forthe inequality constraints is:yi(~xi· ~w∗+ b∗) ≥ 1, for i = 1, ..., mThis ensures a margin of1kw∗k.3. Dual Feasibility:This constraint is the same as in the dual problem.α∗i≥ 0, for i = 1, ..., m4. Complementary Slackness:The implications of this condition are rather interesting. It implies thatα∗i(yi( ~w∗·~xi+ b) − 1) = 0, for i = 1, ..., m3This means that for any input point ~xi, at least one of the primal or dual constraint is tight, i.e., αi= 0or yi(~xi· ~w∗+ b∗) = 1. When yi(~xi· ~w∗+ b∗) > 1, its distance from the separator is greater than ourminimum margin, and this condition ensures that αi= 0 so that point ~xidoes not contribute to ourweight vector. On the other hand, if αiis non-zero, ~xicontributes to our weight vector, so its marginhas to be minimum; we call these points the support vectors.The number of support vectors is typically much smaller than the number of data points m.Support VectorsLinear SeperatorFigure 1: Only the support vectors contribute to our weight vector. If ~xiis a support vector, we have αinonzero and yi(~xi· ~w∗+ b∗) = 13 Solving The Dual ProblemThere are many different algorithms that can be used to solve the dual problem. We can use any type of hillclimbing technique. One popular technique is the Sequential Minimization Optimization (SMO) algorithm,which uses simple coordinate ascent with a few constraints that need to be worked around.Suppose we want to solve an unconstrained optimization problem of the formmax~αf(~α) where f is concave.We could solve this problem by running a simple Coordinate Ascent algorithm. While there are many waysto specify the details of Coordinate Ascent, the outline of the algorithm is as follows:initialize ~αrepeat until convergencechoose some iset αi= argmaxˆαf(αi, ..., αi−1, ˆα, αi+1, ..., αm)4Unfortunately, we can’t apply Coordinate Ascent directly to our problem because of the stationarity condi-tion we

View Full Document


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-3 out of 8 pages.

UCLA COMSCI 260 - SVMs and Kernels

Sign up for free to view:

Please select your school