CS446: Machine Learning, Fall 2017Lecture 14 : Online Learning, Stochastic Gradient Descent, PerceptronLecturer: Sanmi Koyejo Scribe: Ke Wang, Oct. 24th, 2017Agenda• Recap: SVM and Hinge loss, Representer theorem• Stochastic Gradient Descent• Perceptron, other examplesKernelsRecall the feature transformation:For x ∈ X, y ∈ Y, prediction function is:f : X → YLet φ be the feature transformation function, then we have a mapping:x → φ(x)After the transformation, we have:f(x) = wTφ(x) + w0Note heref(x) is a linear function with respect toφ(x), and it’s a non-linear function withrespect to x. (Under the condition that φ(x) is a non-linear function)Define kernel function as:K(xi, xj) = φT(xi)φ(xj)This is especially useful whenφ(x) is infinite dimensional. The motivation of infinite di-mensionalφ(x) is that in theory, you can almost always make your data points linearly12 14 : Online Learning, Stochastic Gradient Descent, PerceptronFigure 1: Data lift to higher dimension becomes linear separable, from Kim (2013)spreadable, by bending the data space to infinite dimensional space. See Figure 1When φ(x) has infinite dimensions, solving w is hard:wTφ(x)where w is also infinite dimensional. Since usually the value of f (x) is in the form of:ΦT(x)Φ(x)we can use kernel function to represent f (x), without knowing φ(x) and wFor a given data point,f(x) = wTφ(x)f(x) =nXi=1K(x, xi)αiWe’ve shown 2 applications of kernels:• Linear Regression– Using a linear algebra identity• SVM– Using the method of Lagrange multipliers14 : Online Learning, Stochastic Gradient Descent, Perceptron 3Hilbert SpacesHkis the space of well behaved linear functions likewTφ(x). And this is equivalent to thespace of bounded functions:f(x) =∞Xi=1αiK(x, xi)K(xi, xj) = φT(xi)φ(xj)Support Vector MachinesThe optimization problem is to maximize the margin, which is proportional to1||w||:min||w||222s.t yif(xi) ≥ 1, ∀i ∈ [n]where f(x) = wTxWhy set the threshold to 1? Suppose change that to some constant C:yif(xi) ≥ Cyif(xi)C≥ 1yiwTCxi≥ 1This won’t change the solution to our optimization problem:min||w||222s.t yif(xi) ≥ 1, ∀i ∈ [n]So we can just set C = 1 without losing generality.Non-spreadable DataRecall that with the slack variable ξi, we have:minw,ξi||w||22+nXi=1ξis.t ξi≥ 0,yf (xi) ≥ 1 − ξi4 14 : Online Learning, Stochastic Gradient Descent, PerceptronFigure 2: Visualize SVM with slack variable (data points with circle are on the margin)Intuitively,ξirepresents how much tolerance to misclassification the model has. See Figure 2f(x) = wTφ(x)w∗=nXi=1αiyiφ(xi) where αi≥ 0w∗=nXi=1eαiφ(xi, where eαi= αiyiWhen data is separable (no slack), we haveαi6= 0 iff yf (x) = 1For the non-separable case, we haveαi> 0 iff yif(xi) = 1 − ξiotherwise αi= 0The prediction function can be written as this form:f(x) =nXiαiK(x, xi)Note that storage for kernel function isO(nd)14 : Online Learning, Stochastic Gradient Descent, Perceptron 5Whereas storage for SVM isO(dS)Where S is the number of support vectorsThis is because we can ignore those entries inKwith value ofK(x, xi), wherexiis not asupport vector.Representer TheoremThe optimization problem in the Hilbert space is:minf∈HknXi=1l(yi, f(xi)) + λ||f||2HkWhere l(yi, f(xi) is the loss function.Representer Theorem says the solution of this can be written as:f∗(x) =nXi=1αiK(x, xi)We have shown that• In kernel ridge regression, use matrix identity we have:l(yi, f(xi) = (yi− f(xi))2• In kernel SVM, use dual representation (Lagrange Multipliers) we have:l(yi, f(xi) = max(0, 1 − yif(xi))This also applies to other kernel algorithms, for example• Kernel logistic regression• Kernel Poisson regression6 14 : Online Learning, Stochastic Gradient Descent, PerceptronStochastic Gradient DescentAll the loss functions we’ve seen so far is average loss over data points:L(w) =1nnXi=1Li(w)In regression we have:Li(w) = (yi− wTxi− w0)2In SVM we have:Li(w) = max(0, 1 − yi(wTxi+ w0))The gradient is defined as:∇wL(w) =1nnXi=1∇wLi(w)The population empirical risk is:L(w) = EP[L(w)]Under weak conditions we have:∇L(w) = EP[∇wL(w)]We can replace EP[∇wL(w)] using a unbiased value, meaning we can pick a β such that:bias = E[β] − EP[∇wL(w)] = 0For example, we can choose mean (µ)Let X ∼ P, µ = E[X]E[µ] = E"1nnXi=1xi#= E[X]We can also choose a single example xifrom the data points, becauseE[xi] = E[X]So we can choose ∇wLi(w) as an unbiased estimator of E[∇wL(w)]Then our Stochastic Gradient Descent algorithm is defined as:In the tth iteration:14 : Online Learning, Stochastic Gradient Descent, Perceptron 7• Pick i uniformly from [n]• Update parameter wt+1= wt+ ηt∇wLi(w)•go to next iteration until hit stop condition. For example we can limit the number ofiterationsThat is, in each iteration we randomly pick a data point (yi, xi), and use that data point tocalculate Li(w) and ∇Li(w). Then use this ”randomly selected” gradient to update w.For example in the linear regression case:Li(w) = (yi− wTxi)2∇Li(w) = −2xi(yi− wTxi)wt+1= wt+ ηt(−2xi(yi− wTxi))Some useful properties of Stochastic Gradient Descent:•Will converge eventually, ifL(w) is convex and appropriately choose step size. SeeFigure 3 and Figure 4• Easily deal with streaming data for online learning.• Adaptive. See Figure 5• Deal with non-differentiable functions easily. See Figure 6Figure 3: Behavior of gradient descent, from Wikipedia, the free encyclopedia (2007).The intuition is:SGD ≈ GD + noise∇wLi(w) ≈ ∇wL(w) + noise8 14 : Online Learning, Stochastic Gradient Descent, PerceptronFigure 4: Behavior of stochastic gradient descentFigure 5: SGD model shifts while get new data (new data are the bigger ones)PerceptronPerceptron is an algorithm used for linear classification. If data is separable, perceptron willfind a linear separator.14 : Online Learning, Stochastic Gradient Descent, Perceptron 9Figure 6: GD gets stuck at the flat point, but SGD won’tConsider the linear model wheref(x) = wTx + w0h(x) = signf(x)The loss function is:Li(w) = max0, −yf (x)Then we have the gradient:∇wLi(w) =0, yif(xi) > 0−yixi, yif(xi) < 00, yif(xi) = 0The perceptron algorithm is defined as:For i ∈ [n], the update rule in each iteration is:wt+1=(wt+ ηtyixi, if yif(xi) < 0 (wrong prediction)wt, if yif(xi) > 0 (correct prediction)10 14 : Online Learning, Stochastic Gradient
View Full Document