Unformatted text preview:

Machine learning lecture 8 Tommi S Jaakkola MIT CSAIL tommi csail mit edu Topics Support vector machines training prediction other kernel methods Kernels examples properties construction feature vectors and sparsity Tommi Jaakkola MIT CSAIL 2 SVM summary Training We can find the optimal setting of the Lagrange multipliers i by maximizing n X n X 1 i yiyj i j K xi xj 2 i 1 i j 1 subject to 0 i C and P i iyi 0 larger C means larger penalty for errors i 0 except for support vectors all misclassified examples will be support vectors w 0 can be found based on examples for which i is between 0 and C when classification constraints are satisfied with equality Tommi Jaakkola MIT CSAIL 3 SVM summary Training We can find the optimal setting of the Lagrange multipliers i by maximizing n X n X 1 i yiyj i j K xi xj 2 i 1 i j 1 subject to 0 i C and P i iyi 0 Prediction We make predictions according to the sign of the discriminant function X y sign w 0 iyiK x xi i SV Tommi Jaakkola MIT CSAIL 4 Other kernel methods linear regression A linear regression model with feature vectors f x w x T w1 w0 where x 1 x m x T We can train these models via regularized least squares 1X 2 yi f x w kw1k2 min 2 i 1 2 We d like to turn these models into kernel methods where the examples feature vectors appear only in inner products K x x0 T x x0 Tommi Jaakkola MIT CSAIL 5 Other kernel methods linear regression Training maximize n X i 1 n X 1 iyi i2 2 i j K xi xj 2 i j 1 subject to i R and P i i 0 The offset parameter w 0 can be obtained directly from the solution n n X 1X w 0 j K xi xj yi n i 1 j 1 Tommi Jaakkola MIT CSAIL 6 Other kernel methods linear regression Training maximize n n X X 1 2 i j K xi xj iyi i 2 2 i j 1 i 1 P subject to i R and i i 0 The offset parameter w 0 can be obtained directly from the solution n n X 1X yi j K xi xj w 0 n i 1 j 1 Prediction the predicted output for a new point x is n X f x w 0 w 0 iK x xi i 1 Tommi Jaakkola MIT CSAIL 7 Other kernel methods logistic regression A logistic regression model with feature vectors T P y 1 x w g x w1 w0 where x 1 x m x T As before we can train these models by minimizing the following regularized empirical loss maximizing penalized log likelihood min Tommi Jaakkola MIT CSAIL n X log P yi xi w kw1k2 2 i 1 8 Other kernel methods logistic regression Training maximize n X n 1 X H i yiyj i j K xi xj 2 i 1 i j 1 subject to 0 i 1 and P i iyi 0 Here H p p log p 1 p log 1 p is the binary entropy function w 0 has to be solved iteratively after obtaining Tommi Jaakkola MIT CSAIL 9 Other kernel methods logistic regression Training maximize n X n X 1 H i yiyj i j K xi xj 2 i j 1 i 1 P subject to 0 i 1 and i iyi 0 Here H p p log p 1 p log 1 p is the binary entropy function w 0 has to be solved iteratively after obtaining Prediction the predicted probabilities over possible labels for a new point x are given by P y 1 x w 0 g w 0 n X iyiK x xi i 1 Tommi Jaakkola MIT CSAIL 10 Example kernels Linear kernel K x x0 xT x0 Polynomial kernel p K x x 1 x x 0 T 0 where p 2 3 To get the feature vectors we concatenate all up to pth order polynomial terms of the components of x weighted appropriately Tommi Jaakkola MIT CSAIL 11 Polynomial kernels with SVMs 2 2 1 5 1 5 1 1 0 5 0 5 0 0 0 5 0 5 1 1 5 1 0 5 0 0 5 1 1 5 2 1 1 5 2 2 1 5 1 5 1 1 0 5 0 5 0 0 0 5 0 5 1 0 5 0 0 5 1 1 5 4th order polynomial Tommi Jaakkola MIT CSAIL 0 5 0 0 5 1 1 5 2 2nd order polynomial linear 1 1 5 1 2 1 1 5 1 0 5 0 0 5 1 1 5 2 8th order polynomial 12 Example kernels Radial basis kernel 1 K x x exp kx x0k2 2 In this case the feature space is infinite dimensional function space use of the kernel results in a non parametric classifier 0 5 0 71 0 77 4 1 24 0 18 3 0 74 0 66 2 0 44 0 67 1 26 1 8 81 0 67 5 75 3 78 0 1 0 41 0 35 2 3 3 0 67 2 1 0 1 2 3 4 5 support vectors need not appear close to the boundary in the input space only in the feature space Tommi Jaakkola MIT CSAIL 13 Definition of kernels We can think of kernels in terms of explicit or implicit feature mappings Definition 1 K x x0 is a kernel if it can be written as an inner product x T x0 for some feature mapping Definition 2 K x x0 is a kernel if for any finite set of training examples x1 xn the n n matrix Kij K xi xj is positive semi definite Tommi Jaakkola MIT CSAIL 14 Kernels and construction We can build kernels from simpler ones For example If K1 x x0 and K2 x x0 are valid kernels then f x K1 x x0 f x0 K1 x x0 K2 x x0 K1 x x0 K2 x x0 scaling sum product are valid kernels If x x1 xd T Rd and Ki xi x0i are valid 1dimensional kernels then K x x0 d Y Ki xi x0i i 1 is a valid kernel in Rd Tommi Jaakkola MIT CSAIL 15 Kernels and sequences We can also derive kernels for variable length sequences For example x my first day this term was x0 Last year the midterm had Gap weighted subsequence kernel X X X 0 id i1 jd j1 K x x u d i u x i j u x j where 0 1 and d is the set of all sequences of length d The kernel reflects the degree to which the sequences have common subsequences penalizing noncontiguous subsequences Tommi Jaakkola MIT CSAIL 16 Dimensionality and complexity Many of these kernels correspond to very high dimensional feature spaces polynomial kernel for large p or dim x radial basis kernel …


View Full Document

MIT 6 867 - Lecture Notes

Loading Unlocking...
Login

Join to view Lecture Notes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture Notes and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?