# ILLINOIS CS 446 - 100517.2 (7 pages)

Previewing pages*1, 2*of 7 page document

**View the full content.**## 100517.2

Previewing pages
*1, 2*
of
actual document.

**View the full content.**View Full Document

## 100517.2

0 0 51 views

- Pages:
- 7
- School:
- University of Illinois - urbana
- Course:
- Cs 446 - Machine Learning

**Unformatted text preview:**

CS446 Machine Learning Fall 2017 Lecture 12 Kernel Ridge Regression Support Vector Machine Lecturer Sanmi Koyejo Scribe Lan Wang Oct 5th 2017 Agenda Recap Kernels Kernel ridge regression Support vector machines Recall Kernel function We define a kernel function to be a real valued function of two arguments used to measure similarity between any pair of data inputs xi xj R xi xj 7 xi T xj Here Rd Rn is the feature mapping function where d is the dimension of each data input and n is the number of data inputs Usually we have d n Mercer Kernel 2 Mercer kernel is a kernel function that satisfies the requirement that the Gram matrix defined by x1 x2 x1 xn x x x x 2 1 2 n K xn x1 xn xn is positive semi definite for any set of data input xi ni 1 Gaussian Kernel Gaussian kernel which is also called RBF radial basis function kernel is a popular kernel function that is defined as the following xi xj e xi xj 2 2 2 2 1 2 e xi xj 2 2 12 Kernel Ridge Regression Support Vector Machine where 1 2 Notice that by taylor expansion we have 0 2 e x x X 2k x k x0 k k 0 k 2 0 2 e x e x So we can rewrite xi xj as e xi xj 22 X k 0 2 k xi k x0j k k 2 2 e xi 2 e xj 2 1 This implies we have infinite dimensional feature maps 2 k xi e xj 2 k xi where k is a function of xi derived from 1 Ridge Regression Ridge Regression is a technique for analyzing multiple regression data that suffer from multicollinearity i e near linear relationships among the independent variables To understand ridge regression let s first recall the linear regression model that we discussed before n X min yi xTi 2 22 i 1 Let Xn d T x1 xTn then the optimized solution is X T X I 1 X T y Using the fact that I AB 1 A A I BA 1 we can rewrite as X T I XX T 1 y Now we have two ways to get the optimized solution X T X I 1 X T y Computation cost O d3 XX T I 1 y X T Computation cost O n3 Remark Clearly we can see from here that is in the row space of X 12 Kernel Ridge Regression Support Vector Machine 3 When d n using the first method is more efficient Based on the above analysis we have T f xn 1 xn 1 xTn 1 xTn 1 X T n X xTn 1 xi i i 1 Now let s consider ridge regression using feature mappings n min yi T xi 2 22 i 1 Let 2 x1 T n m xn T then similarly we have two ways to get the optimized solution T I 1 T y T I 1 y T So that f xn 1 T xn 1 n X xn 1 T xi i i 1 Define a kernel as the following x1 T h x1 T x1 x1 T xn i K T x1 xn xn T xn T x1 xn T xn Then we can rewrite 22 and f as 22 T T T K K I 1 y f x K T n X x xi i i 1 Here K x x1 x xn T x T x1 x T xn T Now we are ready to write the original least square problem 2 as the following equivalent problem min y X 22 T K 3 And the optimized solution is K I 1 y 4 12 Kernel Ridge Regression Support Vector Machine When x Rm and m n problem 3 is more efficient Remark One reason that the kernel is useful is that we can use only it to construct complex in complicated models like the natural network Figure 1 1 layer natural network Support Vector Machines SVMs A Support Vector Machine is a discriminative classifier defined by a separating hyperplane In other words given labeled training data supervised learning the algorithm outputs an optimal hyperplane which categorizes new data 1 Now the question is that in which sense is the hyperplane obtaned optimal Intuitively we want the hyperplane to separate points with different labels as much as possible So a reasonable definition of optimal hyperplane is the following Optimal hyperplane A hyperplane that maximizes the margin defined by the distance from the data point to a decision boundary of the training data Figure 2 linear separable training data 1 12 Kernel Ridge Regression Support Vector Machine 5 Now let s define a hyperplane formally by a function f f x T x 0 The optimal hyperplane can be represented in an infinite number of different ways by scaling of and 0 As a matter of convention among all possible representations of the hyperplane the one chosen is T x 0 1 where x symbolizes the training data closest to the hyperplane So we have a linear classifer f x 1 if y 1 f x 1 if y 1 By the distance formula we can compute the distance between any data point x that satisfies T x 0 1 to the hyperplane d 1 T x 0 2 2 Similarly for any data point x that satisfies T x 0 1 the distance is d 1 2 So we have Margin d d 2 2 To find the optimal hyperplane we need to solve the following optimzed problem max 2 22 such that yi T xi 0 1 0 for any i An equivalent problem is min 22 yi f xi 1 0 for any i such that Now let s solve this problem using the method of Lagrange multipliers n X 1 22 i yi f xi 1 min 0 2 i 1 The solution for this problem is X i i yi xi 6 12 Kernel Ridge Regression Support Vector Machine where i 0 if yi f xi 1 and i 0 if yi f xi 1 In general the training data that are closed to the hyperplane i e the vectors in the set x yi f xi 1 are called support vectors By this definition we can rewrite the solution as X i yi xi support vectors From this it s clear to see that if removing any points except for the support vectors the solution won t change For prediction we can write f x as the following f x T x n X i yi xTi x i 1 Similarly for nonlinear case f x n X i 1 i yi xi T x n X i yi xi x i 1 Since most of the i s are 0 we only need to consider the support vectors Bibliography 1 Introduction to support vector machines ONLINE http docs opencv org 2 4 doc tutorials ml introduction to svm introduction to svm html 2017 2 K Murphy Machine Learning A Probabilistic Perspactive MIT Press 2012 7

View Full Document