# ILLINOIS CS 446 - 101717.2 (7 pages)

Previewing pages*1, 2*of 7 page document

**View the full content.**## 101717.2

Previewing pages
*1, 2*
of
actual document.

**View the full content.**View Full Document

## 101717.2

0 0 53 views

- Pages:
- 7
- School:
- University of Illinois - urbana
- Course:
- Cs 446 - Machine Learning

**Unformatted text preview:**

CS446 Machine Learning Fall 2017 Lecture 15 Stochastic Gradient Descent and Perceptron Lecturer Sanmi Koyejo Scribe Qiwen Wang Oct 17th 2017 Recap Feature mapping Let f be a mapping and be a feature function where f x y x x For example consider a feature function that the separate points inside and outside of a circle The mapping function then becomes f x wT x Here f is a linear function with respect to x but a nonlinear function with respect to x if f is nonlinear The figure on the left of Figure 1 can not represent the separating hyperplane with linear function while after feature mapping we can represent the separating hyperplane with linear function Figure 1 Upon transform the figure with feature function we are able to separate points with a hyperplane of a linear function 1 2 15 Stochastic Gradient Descent and Perceptron Mercer Kernel The kernel between xi and xj is the inner product between the corresponding feature vectors meaning xi xj T xi xj In particular the kernel function is useful when x is infinite dimensional Since f x wT x n X x xi i i 1 when x is infinite dimensional or has a high dimension the kernel approach is more efficient to compute Hilbert Space Hilbert space Hk is a space of bounded linear functions wT x It is equivalent to say that P Hilbert space a space of bounded linear functions f x ni 1 x xi i Support Vector Machine Full the fully separable case the goal is to find min w w 22 2 such that yi f xi 1 for all i n It is same as finding w that maximize this margin 1 w Why we set the threshold to be 1 Note that f x wT x Assume that yi f xi 10 wi yi xi 1 10 T w0 2 1 Let w0 w10 our new goal is to minimize 2 2 such that yi f xi 1 Therefore it doesn t matter what constant we set for the threshold 15 Stochastic Gradient Descent and Perceptron 3 For the non separable cases the goal is to find n min w w 22 X i 2 i 1 such that i 0 yi f xi 1 i for all i n We call i the slack variable that captures the error in the prediction Figure 2 The circled data points are support vectors There s a point with negative label failed to be classified as negative It receives n penalty The solution w could be written in the from w n X i yi xi i 1 where i 0 for all i n i is non zero when corresponding points are support vectors When data is separable and there is no slack the support vectors are all on the boundary meaning i 6 0 if and only if yi f xi 1 When the data is non separable support vectors are xi s t yi f xi 1 i Support vectors for non separable data are not necessarily on the boundary 4 15 Stochastic Gradient Descent and Perceptron Why is SVD important Storage for kernel function needs O nd Storage for SVM only needs O d SU where SU is the number of support vectors Thus using SVM reduces the storage a lot Representation Theorem We can write the the optimization function in terms of Hilbert space Hk min f Hk n X l yi f xi f 2Hk i 1 We can substitute l with different loss function If we choose to use the cost function for kernel ridge regression then l yi f xi yi f xi 2 If we choose to use the cost function for kernel SVM then l yi f xi max 0 1 yi f xi The representation theorem argues that we don t need to write down the loss function but instead the optimal function will be in the form of f x n X i x xi i 1 then we can optimize i directly Such techniques can be applied to kernel logistic regression kernel poisson regression etc Stochastic Gradient Descent All the loss function so far can be written as n l w 1X li w n i 1 We solve the above problem is by using gradient descend n 5w l w 1X 5w li w n i 1 15 Stochastic Gradient Descent and Perceptron 5 which averages gradients over all samples If we don t look at the data in the training set we look at the population empirical risk l w Ep l w Under week conditions the gradient descent is 5l w Ep 5w l w The gradient descent approximates the actual population empirical risk Then we want to replace 5w l w with an unbiased approximation This means we pick such that E Ep 5w l w 0 The unbiased estimator for SGD is 5w l w Pick i uniformly from n the update rule for SGD is wt 1 wt t 5w li w For example pick one random sample yi xi let the loss function be the square loss li w yi wT xi 2 take the derivative on both side we have 5w l w 2xi yi wT xi Thus the update rule for square loss is wt 1 wt t 2xi yi wT xi Intuitively we can see SGD as GD plus noise Useful properties It converges if l w is convex and if the step size is chosen appropriately Online learning Easily deal with streaming data Adaptive Be able to deal with non differentiable function easily 6 15 Stochastic Gradient Descent and Perceptron Figure 3 Gradient Descent The direction is always pointing towards the optimal point Figure 4 Stochastic Gradient Descent The direction is mostly pointing towards the optimal point but includes some noise Perceptron Perceptron is used for linear classification that uses a linear function f x wT x w0 The prediction h x is based on the sign of f x meaning h x sign f x The inequality yi f xi 0 means wrong prediction We update the rule when the prediction is wrong Therefore wt 1 wt t y i x i 0 if yi f xi 0 wrong prediction otherwise 15 Stochastic Gradient Descent and Perceptron The loss function for every sample is li w max 0 yi f xi We apply gradient descent on the loss function then we have 0 yi f xi 0 d max 0 yi f xi yi xi yi f xi 0 dw 0 doesn t matter y f x 0 i i If data is separable then the perceptron will find a linear separable 7

View Full Document