ILLINOIS CS 446 - 101717.1 (11 pages)
Previewing pages 1, 2, 3, 4 of 11 page document View the full content.101717.1
Previewing pages 1, 2, 3, 4 of actual document.
View the full content.View Full Document
101717.1
0
0
41 views
- Pages:
- 11
- School:
- University of Illinois - urbana
- Course:
- Cs 446 - Machine Learning
Unformatted text preview:
CS446 Machine Learning Fall 2017 Lecture 14 Online Learning Stochastic Gradient Descent Perceptron Lecturer Sanmi Koyejo Scribe Ke Wang Oct 24th 2017 Agenda Recap SVM and Hinge loss Representer theorem Stochastic Gradient Descent Perceptron other examples Kernels Recall the feature transformation For x X y Y prediction function is f X Y Let be the feature transformation function then we have a mapping x x After the transformation we have f x wT x w0 Note here f x is a linear function with respect to x and it s a non linear function with respect to x Under the condition that x is a non linear function Define kernel function as K xi xj T xi xj This is especially useful when x is infinite dimensional The motivation of infinite dimensional x is that in theory you can almost always make your data points linearly 1 2 14 Online Learning Stochastic Gradient Descent Perceptron Figure 1 Data lift to higher dimension becomes linear separable from Kim 2013 spreadable by bending the data space to infinite dimensional space See Figure 1 When x has infinite dimensions solving w is hard wT x where w is also infinite dimensional Since usually the value of f x is in the form of T x x we can use kernel function to represent f x without knowing x and w For a given data point f x wT x n X f x K x xi i i 1 We ve shown 2 applications of kernels Linear Regression Using a linear algebra identity SVM Using the method of Lagrange multipliers 14 Online Learning Stochastic Gradient Descent Perceptron 3 Hilbert Spaces Hk is the space of well behaved linear functions like wT x And this is equivalent to the space of bounded functions f x X i K x xi i 1 K xi xj T xi xj Support Vector Machines The optimization problem is to maximize the margin which is proportional to w 22 min 2 yi f xi 1 i n s t f x wT x where Why set the threshold to 1 Suppose change that to some constant C yi f xi C yi f xi 1 C wT xi 1 yi C This won t change the solution to our optimization problem min w 22 s t yi f xi 1 i n 2 So we can just set C 1 without losing generality Non spreadable Data Recall that with the slack variable i we have min w 22 w i s t n X i i 1 i 0 yf xi 1 i 1 w 4 14 Online Learning Stochastic Gradient Descent Perceptron Figure 2 Visualize SVM with slack variable data points with circle are on the margin Intuitively i represents how much tolerance to misclassification the model has See Figure 2 f x wT x n X w i yi xi w i 1 n X ei xi where i 0 where ei i yi i 1
View Full Document