# ILLINOIS CS 446 - 101717.3 (11 pages)

Previewing pages*1, 2, 3, 4*of 11 page document

**View the full content.**## 101717.3

Previewing pages
*1, 2, 3, 4*
of
actual document.

**View the full content.**View Full Document

## 101717.3

0 0 53 views

- Pages:
- 11
- School:
- University of Illinois - urbana
- Course:
- Cs 446 - Machine Learning

**Unformatted text preview:**

CS446 Machine Learning Fall 2017 Lecture 15 Kernels Stochastic Gradient Descent Perceptron Lecturer Sanmi Koyejo Scribe Webber Lee Oct 17th 2017 Agenda Recap SVM and hinge loss representer theorem Stochastic gradient descent Perceptron and other examples Multilayer perceptrons neural networks representation Recap We have inputs x X and we are interested in function f X Y as f x wT x w0 where x is our feature function linear function w r t x non linear function w r t x if x is non linear Figure 1 shows an example of kernel tricks If we map the separating hyperplane in the three dimensional space back to the original two dimensional space we get a circle which is something we cannot represent with a linear function in the original space Kernels 1 2 15 Kernels Stochastic Gradient Descent Perceptron Figure 1 Data lifted to a higher dimension becomes linearly separable source Kim 2013 xi xj T xi xj We show this can be useful because many of our favorite algorithms can be written in terms of just using inner product so all we have to represent linear function is just inner product especially useful when x is infinite dimensional In this case we need to learn an infinite number of parameters for wT In particular we can rewrite f x wT x as n X f x i x xi i 1 We show that for linear regression using a linear algebra identity support vector machine using the form of the solution Langrange dual Hilbert Spaces Hk space of well behaved linear functions wT x X space of bounded functions f x i x xi i 1 Support Vector Machines The problem we end up choosing to solve is 15 Kernels Stochastic Gradient Descent Perceptron 3 1 min kwk22 w 2 subject to yi f xi 1 i n f x wT x The hinge loss is shown in Figure 2 1 0 y f x 1 Figure 2 Hinge loss for SVM Equivalent to minimizing the margin margin 1 kwk Why 1 Because it does not matter It is an arbitrary constant choice Any of scaling can be absorbed by w e g yi f xi 10 1 yi f xi 1 10 1 yi w T x i 1 10 T w xi 1 yi 10 Let w0 w 10 We have an equivalent optimization problem 1 min kw0 k22 2 subject to yi f xi 1 What if data is non separable In that case we have 4 15 Kernels Stochastic Gradient Descent Perceptron min kwk22 w i n X i i 1 subject to i 0 yi f xi 1 i i n i captures the error we make for prediction as shown in Figure 3 So we are also minimizing the sum of errors we make xn n Figure 3 An example of SVM with error f x wT x The solution to this problem is w n X i yi xi i 1 And we know i 0 It is only non zero for points that are support vectors Memory cost storage for kernel function O nd storage for SVM O d sv Representer Theorem 15 Kernels Stochastic Gradient Descent Perceptron minf Hk n X 5 l yi f xi kf k2Hk i 1 kernel ridge regression l yi f xi yi f xi 2 using matrix identity kernel SVM l yi f xi max 0 1 yi f xi using dual representation Lagrange multipliers P f x ni 1 i x xi kernel logistic regression kernel poisson regression Stochastic Gradient Descent SCD Recall that all loss functions we have learned so far have the form n l w 1X li w n i 1 Examples of li w yi wT xi w0 2 max 0 1 yi wT xi w0 Then the gradient Descent G D for a loss function is n 1X 5w li w 5w l w n i 1 Population Empirical risk We can extend the above idea a little bit Instead of looking at the training data we look at the population empirical risk which can be written as 6 15 Kernels Stochastic Gradient Descent Perceptron l w Ep l w Under some weak conditions to be discussed we can do gradient descent to the population empirical risk as well 5l w Ep 5w l w Can we replace Ep 5w l w with something else It turns out if we work out the math the replacement must be unbiased In other words we want to find an unbiased approximation such that Bias E Ep 5w l w 0 where is our choice of approximation and Ep 5w l w is our target Example Given X P let s say we are interested in the mean which is E x One simple unbiased estimator of mean is n n 1X 1X E xi E xi E x n n i 1 i 1 Or an even simpler unbiased estimator is to randomly pick an xi E xi E x Now it can be shown that an unbiased estimator of Ep 5w l w is 5w li w More specifically we pick i uniformly from n i e pick one sample yi xi and the update rule is wt 1 wt t 5w li w Example 15 Kernels Stochastic Gradient Descent Perceptron li w yi wT xi 7 2 5w li w 2xi yi wT xi wt 1 wt t 2xi yi wT xi Visualization of GD and SGD Illustrations for gradient descent and stochastic gradient descent are shown in Figures 4 and 5 respectively For GD it will follow the gradient at every step In contrast for SGD instead of following the gradient it only follows one sample of the gradient so the path may look like a random walk However we can show that for a convex function SGD will end up with a proximity of the answer Figure 4 A visualization of gradient descent source NASACJ 2016 Figure 5 A visualization of stochastic gradient descent source Chavan 2017 Some useful properties Converges if l w is convex and step size is chosen appropriately 8 15 Kernels Stochastic Gradient Descent Perceptron Easily deals with streaming data or online learning Adaptive An illustration is shown in Figure 6 Suppose that at t 101 the distribution for xi and yi shifts from P1 to P2 As a result as we get more and more points from P2 over time the decision boundary will shift accordingly xi yi P1 when t 0 to 100 xi yi P2 when t 100 Figure 6 An illustration of adaptiveness of SGD Deals with non differentiable functions easily We try to explain this concept in an informal intuitive way with an example shown in Figure 7 GD will terminate once it reaches a point where the gradient is 0 even though that point is not optimal In contrast SGD can be thought of as a noisy version of GD i e SGD GD noise therefore 5w li w 5w l w noise With the additional noise term it is less likely for the gradient to become 0 so SGD can continue to seek the minimum …

View Full Document