# ILLINOIS CS 446 - 101717.2 (7 pages)

Previewing pages*1, 2*of 7 page document

**View the full content.**## 101717.2

Previewing pages *1, 2*
of
actual document.

**View the full content.**View Full Document

## 101717.2

0 0 44 views

- Pages:
- 7
- School:
- University of Illinois - urbana
- Course:
- Cs 446 - Machine Learning

**Unformatted text preview: **

CS446 Machine Learning Fall 2017 Lecture 15 Stochastic Gradient Descent and Perceptron Lecturer Sanmi Koyejo Scribe Qiwen Wang Oct 17th 2017 Recap Feature mapping Let f be a mapping and be a feature function where f x y x x For example consider a feature function that the separate points inside and outside of a circle The mapping function then becomes f x wT x Here f is a linear function with respect to x but a nonlinear function with respect to x if f is nonlinear The figure on the left of Figure 1 can not represent the separating hyperplane with linear function while after feature mapping we can represent the separating hyperplane with linear function Figure 1 Upon transform the figure with feature function we are able to separate points with a hyperplane of a linear function 1 2 15 Stochastic Gradient Descent and Perceptron Mercer Kernel The kernel between xi and xj is the inner product between the corresponding feature vectors meaning xi xj T xi xj In particular the kernel function is useful when x is infinite dimensional Since f x wT x n X x xi i i 1 when x is infinite dimensional or has a high dimension the kernel approach is more efficient to compute Hilbert Space Hilbert space Hk is a space of bounded linear functions wT x It is equivalent to say that P Hilbert space a space of bounded linear functions f x ni 1 x xi i Support Vector Machine Full the fully separable case the goal is to find min w w 22 2 such that yi f xi 1 for all i n It is same as finding w that maximize this margin 1 w Why we set the threshold to be 1 Note that f x wT x Assume that yi f xi 10 wi yi xi 1 10 T w0 2 1 Let w0 w10 our new goal is to minimize 2 2 such that yi f xi 1 Therefore it doesn t matter what constant we set for the threshold 15 Stochastic Gradient Descent and Perceptron 3 For the non separable cases the goal is to find n min w w 22 X i 2 i 1 such that i 0 yi f xi 1 i for all i n We call i the slack variable that captures the error in the prediction Figure 2 The circled data points are support vectors There s a point with negative label failed to be classified as negative It receives n penalty The solution w could be written in the from w n X i yi xi i 1 where i 0 for all i n i is non zero when corresponding points are support vectors When data is separable and there is no slack the support vectors are all on the boundary meaning i 6 0 if and only if yi f xi 1 When the data is non separable support vectors are xi s t

View Full Document