# ILLINOIS CS 446 - 101717.3 (11 pages)

Previewing pages*1, 2, 3, 4*of 11 page document

**View the full content.**## 101717.3

Previewing pages *1, 2, 3, 4*
of
actual document.

**View the full content.**View Full Document

## 101717.3

0 0 41 views

- Pages:
- 11
- School:
- University of Illinois - urbana
- Course:
- Cs 446 - Machine Learning

**Unformatted text preview: **

CS446 Machine Learning Fall 2017 Lecture 15 Kernels Stochastic Gradient Descent Perceptron Lecturer Sanmi Koyejo Scribe Webber Lee Oct 17th 2017 Agenda Recap SVM and hinge loss representer theorem Stochastic gradient descent Perceptron and other examples Multilayer perceptrons neural networks representation Recap We have inputs x X and we are interested in function f X Y as f x wT x w0 where x is our feature function linear function w r t x non linear function w r t x if x is non linear Figure 1 shows an example of kernel tricks If we map the separating hyperplane in the three dimensional space back to the original two dimensional space we get a circle which is something we cannot represent with a linear function in the original space Kernels 1 2 15 Kernels Stochastic Gradient Descent Perceptron Figure 1 Data lifted to a higher dimension becomes linearly separable source Kim 2013 xi xj T xi xj We show this can be useful because many of our favorite algorithms can be written in terms of just using inner product so all we have to represent linear function is just inner product especially useful when x is infinite dimensional In this case we need to learn an infinite number of parameters for wT In particular we can rewrite f x wT x as n X f x i x xi i 1 We show that for linear regression using a linear algebra identity support vector machine using the form of the solution Langrange dual Hilbert Spaces Hk space of well behaved linear functions wT x X space of bounded functions f x i x xi i 1 Support Vector Machines The problem we end up choosing to solve is 15 Kernels Stochastic Gradient Descent Perceptron 3 1 min kwk22 w 2 subject to yi f xi 1 i n f x wT x The hinge loss is shown in Figure 2 1 0 y f x 1 Figure 2 Hinge loss for SVM Equivalent to minimizing the margin margin 1 kwk Why 1 Because it does not matter It is an arbitrary constant choice Any of scaling can be absorbed by w e g yi f xi 10 1 yi f xi 1 10 1 yi w T x i 1 10 T w xi 1 yi 10 Let w0 w 10 We have an equivalent optimization problem 1 min kw0 k22 2 subject to yi f xi 1 What if data is non separable In that case we have 4 15 Kernels Stochastic Gradient Descent Perceptron min kwk22 w i n X i i 1 subject to i 0 yi f xi 1 i i n i captures the error we make for prediction as shown in Figure 3 So we are also minimizing the sum of errors we make xn n Figure 3 An example of SVM with error f x wT x The solution to this problem is w n X i yi xi i 1 And

View Full Document