UCLA COMSCI 260 - Convex Optimization and Maximizing the Margin - D1701243

Home> Schools> University of California, Los Angeles> Computer Science (COMSCI) > COMSCI 260> Convex Optimization and Maximizing the Margin

DOC PREVIEW

UCLA COMSCI 260 - Convex Optimization and Maximizing the Margin

School name University of California, Los Angeles

Course Comsci 260- Machine Learning Theory

Pages 5

This preview shows page 1-2 out of 5 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 5 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 5 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 5 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

CS260: Machine Learning TheoryLecture 15: Convex Optimization and Maximizing the MarginNovember 14, 2011Lecturer: Jennifer Wortman Vaughan1 Road MapDuring our discussion of AdaBoost, we saw that large margin classifiers lead to low generalization error.Given this observation, one might ask if it is possible to design a learning algorithm that explicitly maximizesthe margin of the training data. This is the idea behind Support Vector Machines (SVMs), which maximizethe margin directly using techniques from convex optimization.We begin this lecture with a review of the necessary ideas from optimization, and then go on to definethe primal form of the SVM algorithm. In the next lecture, we will derive the dual form of the algorithm,which turns out to be useful. We will also see how to extend the algorithm to handle data that cannot beclassified perfectly with a linear separator.For a much more thorough introduction to convex optimization, check out Boyd and Vandenberghe’sgreat book, which is available for free online,1or take Vandenberghe’s class here at UCLA.2 Basics of Convex OptimizationWe say that an optimization problem is in standard form if it is possible to write it asmin~wf( ~w)s.t. gi( ~w) ≤ 0, i = 1, ..., khi( ~w) = 0, i = 1, ..., `for some objective function f, inequality constraints gi( ~w) ≤ 0, i = 1, ..., k, and equality constraintshi( ~w) = 0, i = 1, ..., `.We say that a standard form problem is convex if f is convex, each of the functions giis convex, andeach of the functions hiis affine, i.e., can be written as hi( ~w) = ~z · ~w + b for some ~z and b.2Convex standard form optimization problems are nice because they can be solved efficiently using avariety of different techniques including the subgradient method, the interior point method, and the ellipsoidmethod (which we unfortunately will not have time to discuss in this class). We will refer to a problem ofthis form as our primal problem.All CS260 lecture notes build on the scribes’ notes written by UCLA students in the Fall 2010 offering of this course. Althoughthey have been carefully reviewed, it is entirely possible that some of them contain errors. If you spot an error, please email Jenn.1http://www.stanford.edu/˜boyd/cvxbook/bv_cvxbook.pdf2This is different from the definition given in class, where we said that “standard form” implies that f and giare convex and hiaffine, but as some people pointed out, it is more common in references to see “standard form” defined as it is here now.1The Lagrangian of an optimization problem is defined asL( ~w, ~α,~β) = f( ~w) +kXi=1αigi( ~w) +lXi=1βihi( ~w) ,and the components of ~α and~β are referred to as Lagrange multipliers. Let’s think about the valuemax~α,~β:αi≥0L( ~w, ~α,~β) = max~α,~β:αi≥0 f( ~w) +kXi=1αigi( ~w) +lXi=1βihi( ~w)!.First consider the case in which the primal constraints are met. In this case, the Lagrangian is maximizedwhen we set each αito 0. Each βican be set arbitrarily since hi( ~w) = 0. In this case, the second and thirdterms of the Lagrangian are 0, and we’re left with just f( ~w). What if the primal constraints are not met. Inthis case, it is easy to verify that maximal value is infinite. We therefore have thatmax~α,~β:αi≥0L( ~w, ~α,~β) =f( ~w) : if the primal constraints are satisfied∞ : otherwise.Because of this, the primal optimization problem is equivalent to the following:min~wmax~α,~β:αi≥0L( ~w, ~α,~β) .We can define what is called the dual optimization problem, by switching the min and max:max~α,~β:αi≥0min~wL( ~w, ~α,~β) .Let’s see how these problems are related. It is easy to show thatmax~α,~β:αi≥0min~wL( ~w, ~α,~β) ≤ min~wmax~α,~β:αi≥0L( ~w, ~α,~β) .This has nothing to do with the form of the Lagrangian; the “max min” is always less than or equal to the“min max,” a fact that was used in the proof of the Minimax Theorem.What is more surprising is that for standard form problems that are convex (i.e., have convex f and giand affine hi) if it is the case that for all i, there exists some ~w such that gi( ~w) < 0, then these two quantitiesare actually equal.2.1 KKT ConditionsSuppose we have an optimization problem in standard form that is convex, and that for all i, there existssome ~w such that gi( ~w) < 0. We know in this case that optimal value of the primal problem is equal to theoptimal value of the dual problem. It turns out that particular values of ~w∗, ~α∗and~β∗are solutions to theseoptimization problems if and only if they satisfy the following conditions.2The KKT Conditions:• Stationarity:∂∂wiL( ~w∗, ~α∗,~β∗) = 0 i = 1, ..., n• Primal Feasibility:hi( ~w∗) = 0 i = 1, ..., lgi( ~w∗) ≤ 0 i = 1, ..., k• Dual Feasibility:~α∗≥ 0 i = 1, ..., k• Complementary Slackness:α∗igi( ~w∗) = 0 i = 1, ..., kThis final condition will be important later when we talk about the dual of the SVM problem.3 Maximizing the MarginConsider a linear threshold function of the form y = sign( ~w · ~x + b), where ~w is the weight vector and b isthe intercept (signifying that the threshold function may not pass through the origin). The (L2) margin of alabeled point (~x, y) with respect to this threshold is given byy~w · x + b|| ~w||.Intuitively, the larger the margin of a point is, the higher our “confidence” that the point belongs to a certainclass should be.Assume we have m separable points (~x1, y1), (~x2, y2), . . . (~xm, ym), i.e., m labeled points for whichthere exists a consistent linear separator. Given these training samples, we would like to find a classifier thatmaximizes the value of the margin. We can attempt to formulate this problem as a standard form convexoptimization problem as follows:maxγ, ~w,bγs.t. yi~w · ~xi+ b|| ~w||≥ γ, i = 1, ..., mWe have a max instead of a min here, but that problem alone would be easy to fix. The bigger problem isthe presence of the term 1/||~w|| in the inequality constraints. This term is not convex, meaning that we couldnot use convex standard form solvers for this problem. So, let’s attempt to reformulate this optimizationproblem to get rid of this annoying constraint.3We can get this term out of the inequality constraints by defining a new variable γ0= γ|| ~w||, andrewriting the optimization in terms of this new variable. We havemaxγ0, ~w,bγ0|| ~w||s.t. yi( ~w · ~xi+ b) ≥ γ0, i = 1, ..., mThe inequality constraints are looking nicer, but we still have

View Full Document