MIT 6 867 - The Support Vector Machine and regularization - D2526232

Home> Schools> Massachusetts Institute of Technology> Electrical Engineering and Computer Science (6) > 6 867> The Support Vector Machine and regularization

DOC PREVIEW

MIT 6 867 - The Support Vector Machine and regularization

School name Massachusetts Institute of Technology

Course 6 867- Machine Learning

Pages 7

This preview shows page 1-2 out of 7 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 7 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 7 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 7 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

� 1 6.867 Machine learning, lecture 4 (Jaakkola) The Support Vector Machine and regularization We proposed a simple relaxed optimization problem for ﬁnding the maximum margin sep-arator when some of the examples may be misclassiﬁed: minimize 1 2�θ�2 + C n� ξt (1) t=1 subject to yt(θT xt + θ0) ≥ 1 − ξt and ξt ≥ 0 for all t = 1, . . . , n (2) where the remaining parameter C could be set by cross-validation, i.e., by minimizing the leave-one-out cross-validation error. The goal here is to brieﬂy understand the relaxed optimization problem from the point of view of regularization. Regularization problems are typically formulated as optimization problems involving the desired objective (classiﬁcation loss in our case) and a regularization penalty. The regularization penalty is used to help stabilize the minimization of the ob-jective or infuse prior knowledge we might have about desirable solutions. Many machine learning methods can be viewed as regularization methods in this manner. For later utility we will cast SVM optimization problem as a regularization problem. a) −3 −2 −1 0 1 2 3−1−0.500.511.522.53b) −3 −2 −1 0 1 2 3−1−0.500.511.522.53Figure 1: a) The hinge loss (1 − z)+ as a function of z. b) The logistic loss log[1 + exp(−z)] as a function of z. To turn the relaxed optimization problem into a regularization problem we deﬁne a loss function that corresponds to individually optimized ξt values and speciﬁes the cost of vio-lating each of the margin constraints. We are eﬀectively solving the optimization problem with respect to the ξ values for a ﬁxed θ and θ0. This will lead to an expression of C t ξt as a function of θ and θ0. Cite as: Tommi Jaakkola, course materials for 6.867 Machine Learning, Fall 2006. MIT OpenCourseWare(http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].� � �+ 2 6.867 Machine learning, lecture 4 (Jaakkola) The loss function we need for this purpose is based on the hinge loss Lossh(z) deﬁned as the positive part of 1 − z, written as (1 − z)+ (see Figure 1a). The relaxed optimization problem can be written using the hinge loss as =ξˆt n� �� minimize 12�θ�2 + C �� 1 − yt(θT xt + θ0) �+ (3) t=1 Here �θ�2/2, the inverse squared geometric margin, is viewed as a regularization penalty that helps stabilize the objective nC 1 − yt(θT xt + θ0) (4) t=1 In other words, when no margin constraints are violated (zero loss), the regularization penalty helps us sele ct the solution with the largest geometric margin. Logistic regression, maximum like lihood estimation −3 −2 −1 0 1 2 300.10.20.30.40.50.60.70.80.91Figure 2: The logistic function g(z) = (1 + exp(−z))−1 . Another way of dealing with noisy labels in linear classiﬁcation is to model how the noisy labels are generated. For example, human assigned labels tend to be very good for “typical examples” but exhibit some variation in more diﬃcult cases. One simple model of noisy labels in linear classiﬁcation is a logistic regression model. In this model we assign a Cite as: Tommi Jaakkola, course materials for 6.867 Machine Learning, Fall 2006. MIT OpenCourseWare(http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].� � � � � � � 3 6.867 Machine learning, lecture 4 (Jaakkola) probability distribution over the two labels in such a way that the labels for examples further away from the decision boundary are more likely to be correct. More precisely, we say that P (y = 1|x, θ, θ0) = g θT x + θ0 (5) where g(z) = (1 + exp(−z))−1 is known as the logistic function (Figure 2). One way to derive the form of the logistic function is to say that the log-odds of the predicted class probabilities should be a linear function of the inputs: log P (y = 1|x, θ, θ0)= θT x + θ0 (6) P (y = −1|x, θ, θ0) So for example, when we predict the same probability (1/2) for both classes, the log-odds term is zero and we recover the decision boundary θT x + θ0 = 0. The precise functional form of the logistic function, or, equivalently, the fact that we chose to model log-odds with the linear prediction, may seem a little arbitrary (but perhaps not more so than the hinge loss used with the SVM classiﬁer). We will derive the form of the logistic function later on in the course based on certain assumptions about class-conditional distributions P (x|y = 1) and P (x|y = −1). In order to better compare the logistic regression model with the SVM we will write the conditional probability P (y|x, θ, θ0) a bit more succinctly. Speciﬁcally, since 1 − g(z) = g(−z) we get P (y = −1|x, θ, θ0) = 1 − P (y = 1|x, θ, θ0) = 1 − g( θT x + θ0 ) = g −(θT x + θ0) (7) and therefore P (y|x, θ, θ0) = g y(θT x + θ0) (8) So now we have a linear classiﬁer that makes probabilistic predictions about the labels. How should we train such models? A sensible criterion would seem to be to maximize the probability that we predict the correct label in response to each example. Assuming each example is labeled independently from others, this probability of assigning correct labels to examples is given by the product nL(θ, θ0) = P (yt|xt, θ, θ0) (9) t=1 Cite as: Tommi Jaakkola, course materials for 6.867 Machine Learning, Fall 2006. MIT OpenCourseWare(http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].� � � � 4 6.867 Machine learning, lecture 4 (Jaakkola) L(θ, θ0) is known as the (conditional) likelihood function and is interpreted as a function of the parameters for a ﬁxed data (labels and examples). By maximizing this conditional likelihood with respect to θ and θ0 we obtain maximum likelihood estimates of the param-eters. Maximum likelihoo d estimators1 have many nice properties. For example, assuming we have selected the right model class (logistic regression model) and certain regularity conditions hold, then the ML estimator is a) consistent (we will get the right parameter values in the limit of a large numb er of training examples), and b) eﬃcient (no other esti-mator will converge to the correct parameter values faster in the mean squared sense). But what if we do not have the right model class? Neither property may hold

View Full Document