Unformatted text preview:

6 867 Machine learning lecture 4 Jaakkola 1 The Support Vector Machine and regularization We proposed a simple relaxed optimization problem for nding the maximum margin sep arator when some of the examples may be misclassi ed n 1 minimize 2 C t 2 t 1 1 subject to yt T xt 0 1 t and t 0 for all t 1 n 2 where the remaining parameter C could be set by cross validation i e by minimizing the leave one out cross validation error The goal here is to brie y understand the relaxed optimization problem from the point of view of regularization Regularization problems are typically formulated as optimization problems involving the desired objective classi cation loss in our case and a regularization penalty The regularization penalty is used to help stabilize the minimization of the ob jective or infuse prior knowledge we might have about desirable solutions Many machine learning methods can be viewed as regularization methods in this manner For later utility we will cast SVM optimization problem as a regularization problem a 3 3 2 5 2 5 2 2 1 5 1 5 1 1 0 5 0 5 0 0 0 5 0 5 1 3 2 1 0 1 2 3 b 1 3 2 1 0 1 2 3 Figure 1 a The hinge loss 1 z as a function of z b The logistic loss log 1 exp z as a function of z To turn the relaxed optimization problem into a regularization problem we de ne a loss function that corresponds to individually optimized t values and speci es the cost of vio lating each of the margin constraints We are e ectively solving the optimization problem with respect to the values for a xed and 0 This will lead to an expression of C t t as a function of and 0 Cite as Tommi Jaakkola course materials for 6 867 Machine Learning Fall 2006 MIT OpenCourseWare http ocw mit edu Massachusetts Institute of Technology Downloaded on DD Month YYYY 6 867 Machine learning lecture 4 Jaakkola 2 The loss function we need for this purpose is based on the hinge loss Lossh z de ned as the positive part of 1 z written as 1 z see Figure 1a The relaxed optimization problem can be written using the hinge loss as 1 minimize 2 C 2 n t 1 yt xt 0 T 3 t 1 Here 2 2 the inverse squared geometric margin is viewed as a regularization penalty that helps stabilize the objective C n 1 yt T xt 0 4 t 1 In other words when no margin constraints are violated zero loss the regularization penalty helps us select the solution with the largest geometric margin Logistic regression maximum likelihood estimation 1 0 9 0 8 0 7 0 6 0 5 0 4 0 3 0 2 0 1 0 3 2 1 0 1 2 3 Figure 2 The logistic function g z 1 exp z 1 Another way of dealing with noisy labels in linear classi cation is to model how the noisy labels are generated For example human assigned labels tend to be very good for typical examples but exhibit some variation in more di cult cases One simple model of noisy labels in linear classi cation is a logistic regression model In this model we assign a Cite as Tommi Jaakkola course materials for 6 867 Machine Learning Fall 2006 MIT OpenCourseWare http ocw mit edu Massachusetts Institute of Technology Downloaded on DD Month YYYY 6 867 Machine learning lecture 4 Jaakkola 3 probability distribution over the two labels in such a way that the labels for examples further away from the decision boundary are more likely to be correct More precisely we say that P y 1 x 0 g T x 0 5 where g z 1 exp z 1 is known as the logistic function Figure 2 One way to derive the form of the logistic function is to say that the log odds of the predicted class probabilities should be a linear function of the inputs log P y 1 x 0 T x 0 P y 1 x 0 6 So for example when we predict the same probability 1 2 for both classes the log odds term is zero and we recover the decision boundary T x 0 0 The precise functional form of the logistic function or equivalently the fact that we chose to model log odds with the linear prediction may seem a little arbitrary but perhaps not more so than the hinge loss used with the SVM classi er We will derive the form of the logistic function later on in the course based on certain assumptions about class conditional distributions P x y 1 and P x y 1 In order to better compare the logistic regression model with the SVM we will write the conditional probability P y x 0 a bit more succinctly Speci cally since 1 g z g z we get P y 1 x 0 1 P y 1 x 0 1 g T x 0 g T x 0 7 and therefore P y x 0 g y T x 0 8 So now we have a linear classi er that makes probabilistic predictions about the labels How should we train such models A sensible criterion would seem to be to maximize the probability that we predict the correct label in response to each example Assuming each example is labeled independently from others this probability of assigning correct labels to examples is given by the product L 0 n P yt xt 0 9 t 1 Cite as Tommi Jaakkola course materials for 6 867 Machine Learning Fall 2006 MIT OpenCourseWare http ocw mit edu Massachusetts Institute of Technology Downloaded on DD Month YYYY 6 867 Machine learning lecture 4 Jaakkola 4 L 0 is known as the conditional likelihood function and is interpreted as a function of the parameters for a xed data labels and examples By maximizing this conditional likelihood with respect to and 0 we obtain maximum likelihood estimates of the param eters Maximum likelihood estimators 1 have many nice properties For example assuming we have selected the right model class logistic regression model and certain regularity conditions hold then the ML estimator is a consistent we will get the right parameter values in the limit of a large number of training examples and b e cient no other esti mator will converge to the correct parameter values faster in the mean squared sense But what if we do not have the right model class Neither property may hold as a result More robust estimators can be found in a larger class of estimators called M estimators that includes maximum likelihood We will nevertheless use the maximum likelihood principle to set the parameter values The product form of the conditional likelihood function is a bit di cult to work with directly so we will maximize its logarithm instead l 0 n log P yt xt 0 10 t 1 Alternatively we can minimize the negative logarithm log loss n …


View Full Document

MIT 6 867 - The Support Vector Machine and regularization

Loading Unlocking...
Login

Join to view The Support Vector Machine and regularization and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view The Support Vector Machine and regularization and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?