DOC PREVIEW
MIT 6 867 - The Support Vector Machine and regularization

This preview shows page 1-2 out of 7 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 7 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 7 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 7 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

� 1 6.867 Machine learning, lecture 4 (Jaakkola) The Support Vector Machine and regularization We proposed a simple relaxed optimization problem for finding the maximum margin sep-arator when some of the examples may be misclassified: minimize 1 2�θ�2 + C n� ξt (1) t=1 subject to yt(θT xt + θ0) ≥ 1 − ξt and ξt ≥ 0 for all t = 1, . . . , n (2) where the remaining parameter C could be set by cross-validation, i.e., by minimizing the leave-one-out cross-validation error. The goal here is to briefly understand the relaxed optimization problem from the point of view of regularization. Regularization problems are typically formulated as optimization problems involving the desired objective (classification loss in our case) and a regularization penalty. The regularization penalty is used to help stabilize the minimization of the ob-jective or infuse prior knowledge we might have about desirable solutions. Many machine learning methods can be viewed as regularization methods in this manner. For later utility we will cast SVM optimization problem as a regularization problem. a) −3 −2 −1 0 1 2 3−1−0.500.511.522.53b) −3 −2 −1 0 1 2 3−1−0.500.511.522.53Figure 1: a) The hinge loss (1 − z)+ as a function of z. b) The logistic loss log[1 + exp(−z)] as a function of z. To turn the relaxed optimization problem into a regularization problem we define a loss function that corresponds to individually optimized ξt values and specifies the cost of vio-lating each of the margin constraints. We are effectively solving the optimization problem with respect to the ξ values for a fixed θ and θ0. This will lead to an expression of C t ξt as a function of θ and θ0. Cite as: Tommi Jaakkola, course materials for 6.867 Machine Learning, Fall 2006. MIT OpenCourseWare(http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].� � �+ 2 6.867 Machine learning, lecture 4 (Jaakkola) The loss function we need for this purpose is based on the hinge loss Lossh(z) defined as the positive part of 1 − z, written as (1 − z)+ (see Figure 1a). The relaxed optimization problem can be written using the hinge loss as =ξˆt n� �� � minimize 12�θ�2 + C �� 1 − yt(θT xt + θ0) �+ (3) t=1 Here �θ�2/2, the inverse squared geometric margin, is viewed as a regularization penalty that helps stabilize the objective nC 1 − yt(θT xt + θ0) (4) t=1 In other words, when no margin constraints are violated (zero loss), the regularization penalty helps us sele ct the solution with the largest geometric margin. Logistic regression, maximum like lihood estimation −3 −2 −1 0 1 2 300.10.20.30.40.50.60.70.80.91Figure 2: The logistic function g(z) = (1 + exp(−z))−1 . Another way of dealing with noisy labels in linear classification is to model how the noisy labels are generated. For example, human assigned labels tend to be very good for “typical examples” but exhibit some variation in more difficult cases. One simple model of noisy labels in linear classification is a logistic regression model. In this model we assign a Cite as: Tommi Jaakkola, course materials for 6.867 Machine Learning, Fall 2006. MIT OpenCourseWare(http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].� � � � � � � 3 6.867 Machine learning, lecture 4 (Jaakkola) probability distribution over the two labels in such a way that the labels for examples further away from the decision boundary are more likely to be correct. More precisely, we say that P (y = 1|x, θ, θ0) = g θT x + θ0 (5) where g(z) = (1 + exp(−z))−1 is known as the logistic function (Figure 2). One way to derive the form of the logistic function is to say that the log-odds of the predicted class probabilities should be a linear function of the inputs: log P (y = 1|x, θ, θ0)= θT x + θ0 (6) P (y = −1|x, θ, θ0) So for example, when we predict the same probability (1/2) for both classes, the log-odds term is zero and we recover the decision boundary θT x + θ0 = 0. The precise functional form of the logistic function, or, equivalently, the fact that we chose to model log-odds with the linear prediction, may seem a little arbitrary (but perhaps not more so than the hinge loss used with the SVM classifier). We will derive the form of the logistic function later on in the course based on certain assumptions about class-conditional distributions P (x|y = 1) and P (x|y = −1). In order to better compare the logistic regression model with the SVM we will write the conditional probability P (y|x, θ, θ0) a bit more succinctly. Specifically, since 1 − g(z) = g(−z) we get P (y = −1|x, θ, θ0) = 1 − P (y = 1|x, θ, θ0) = 1 − g( θT x + θ0 ) = g −(θT x + θ0) (7) and therefore P (y|x, θ, θ0) = g y(θT x + θ0) (8) So now we have a linear classifier that makes probabilistic predictions about the labels. How should we train such models? A sensible criterion would seem to be to maximize the probability that we predict the correct label in response to each example. Assuming each example is labeled independently from others, this probability of assigning correct labels to examples is given by the product nL(θ, θ0) = P (yt|xt, θ, θ0) (9) t=1 Cite as: Tommi Jaakkola, course materials for 6.867 Machine Learning, Fall 2006. MIT OpenCourseWare(http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].� � � � 4 6.867 Machine learning, lecture 4 (Jaakkola) L(θ, θ0) is known as the (conditional) likelihood function and is interpreted as a function of the parameters for a fixed data (labels and examples). By maximizing this conditional likelihood with respect to θ and θ0 we obtain maximum likelihood estimates of the param-eters. Maximum likelihoo d estimators1 have many nice properties. For example, assuming we have selected the right model class (logistic regression model) and certain regularity conditions hold, then the ML estimator is a) consistent (we will get the right parameter values in the limit of a large numb er of training examples), and b) efficient (no other esti-mator will converge to the correct parameter values faster in the mean squared sense). But what if we do not have the right model class? Neither property may hold


View Full Document

MIT 6 867 - The Support Vector Machine and regularization

Download The Support Vector Machine and regularization
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view The Support Vector Machine and regularization and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view The Support Vector Machine and regularization 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?