# ILLINOIS CS 446 - 101717.1 (11 pages)

Previewing pages*1, 2, 3, 4*of 11 page document

**View the full content.**## 101717.1

Previewing pages
*1, 2, 3, 4*
of
actual document.

**View the full content.**View Full Document

## 101717.1

0 0 49 views

- Pages:
- 11
- School:
- University of Illinois - urbana
- Course:
- Cs 446 - Machine Learning

**Unformatted text preview:**

CS446 Machine Learning Fall 2017 Lecture 14 Online Learning Stochastic Gradient Descent Perceptron Lecturer Sanmi Koyejo Scribe Ke Wang Oct 24th 2017 Agenda Recap SVM and Hinge loss Representer theorem Stochastic Gradient Descent Perceptron other examples Kernels Recall the feature transformation For x X y Y prediction function is f X Y Let be the feature transformation function then we have a mapping x x After the transformation we have f x wT x w0 Note here f x is a linear function with respect to x and it s a non linear function with respect to x Under the condition that x is a non linear function Define kernel function as K xi xj T xi xj This is especially useful when x is infinite dimensional The motivation of infinite dimensional x is that in theory you can almost always make your data points linearly 1 2 14 Online Learning Stochastic Gradient Descent Perceptron Figure 1 Data lift to higher dimension becomes linear separable from Kim 2013 spreadable by bending the data space to infinite dimensional space See Figure 1 When x has infinite dimensions solving w is hard wT x where w is also infinite dimensional Since usually the value of f x is in the form of T x x we can use kernel function to represent f x without knowing x and w For a given data point f x wT x n X f x K x xi i i 1 We ve shown 2 applications of kernels Linear Regression Using a linear algebra identity SVM Using the method of Lagrange multipliers 14 Online Learning Stochastic Gradient Descent Perceptron 3 Hilbert Spaces Hk is the space of well behaved linear functions like wT x And this is equivalent to the space of bounded functions f x X i K x xi i 1 K xi xj T xi xj Support Vector Machines The optimization problem is to maximize the margin which is proportional to w 22 min 2 yi f xi 1 i n s t f x wT x where Why set the threshold to 1 Suppose change that to some constant C yi f xi C yi f xi 1 C wT xi 1 yi C This won t change the solution to our optimization problem min w 22 s t yi f xi 1 i n 2 So we can just set C 1 without losing generality Non spreadable Data Recall that with the slack variable i we have min w 22 w i s t n X i i 1 i 0 yf xi 1 i 1 w 4 14 Online Learning Stochastic Gradient Descent Perceptron Figure 2 Visualize SVM with slack variable data points with circle are on the margin Intuitively i represents how much tolerance to misclassification the model has See Figure 2 f x wT x n X w i yi xi w i 1 n X ei xi where i 0 where ei i yi i 1 When data is separable no slack we have i 6 0 iff yf x 1 For the non separable case we have i 0 iff yi f xi 1 i otherwise i 0 The prediction function can be written as this form f x n X i K x xi i Note that storage for kernel function is O nd 14 Online Learning Stochastic Gradient Descent Perceptron 5 Whereas storage for SVM is O dS Where S is the number of support vectors This is because we can ignore those entries in K with value of K x xi where xi is not a support vector Representer Theorem The optimization problem in the Hilbert space is min f Hk n X l yi f xi f 2Hk i 1 Where l yi f xi is the loss function Representer Theorem says the solution of this can be written as f x n X i K x xi i 1 We have shown that In kernel ridge regression use matrix identity we have l yi f xi yi f xi 2 In kernel SVM use dual representation Lagrange Multipliers we have l yi f xi max 0 1 yi f xi This also applies to other kernel algorithms for example Kernel logistic regression Kernel Poisson regression 6 14 Online Learning Stochastic Gradient Descent Perceptron Stochastic Gradient Descent All the loss functions we ve seen so far is average loss over data points n L w 1X Li w n i 1 In regression we have Li w yi wT xi w0 2 In SVM we have Li w max 0 1 yi wT xi w0 The gradient is defined as n 1X w L w w Li w n i 1 The population empirical risk is L w EP L w Under weak conditions we have L w EP w L w We can replace EP w L w using a unbiased value meaning we can pick a such that bias E EP w L w 0 For example we can choose mean Let X P E X n 1X xi E X E E n i 1 We can also choose a single example xi from the data points because E xi E X So we can choose w Li w as an unbiased estimator of E w L w Then our Stochastic Gradient Descent algorithm is defined as In the tth iteration 14 Online Learning Stochastic Gradient Descent Perceptron 7 Pick i uniformly from n Update parameter wt 1 wt t w Li w go to next iteration until hit stop condition For example we can limit the number of iterations That is in each iteration we randomly pick a data point yi xi and use that data point to calculate Li w and Li w Then use this randomly selected gradient to update w For example in the linear regression case 2 Li w yi wT xi Li w 2xi yi wT xi wt 1 wt t 2xi yi wT xi Some useful properties of Stochastic Gradient Descent Will converge eventually if L w is convex and appropriately choose step size See Figure 3 and Figure 4 Easily deal with streaming data for online learning Adaptive See Figure 5 Deal with non differentiable functions easily See Figure 6 Figure 3 Behavior of gradient descent from Wikipedia the free encyclopedia 2007 The intuition is SGD GD noise w Li w w L w noise 8 14 Online Learning Stochastic Gradient Descent Perceptron Figure 4 Behavior of stochastic gradient descent Figure 5 SGD model shifts while get new data new data are the bigger ones Perceptron Perceptron is an algorithm used for linear classification If data is separable perceptron will find a linear separator 14 Online Learning Stochastic Gradient Descent Perceptron Figure 6 GD gets stuck at the flat point but SGD won t Consider the linear model where f x wT x w0 h x sign f x The loss function is Li w max 0 yf x Then we have the gradient w Li w 0 yi f xi 0 yi xi yi f xi 0 0 yi f xi 0 The perceptron algorithm is defined as For …

View Full Document