DOC PREVIEW
ILLINOIS CS 446 - 101717.1

This preview shows page 1-2-3-4 out of 11 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 11 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 11 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 11 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 11 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 11 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

CS446: Machine Learning, Fall 2017Lecture 14 : Online Learning, Stochastic Gradient Descent, PerceptronLecturer: Sanmi Koyejo Scribe: Ke Wang, Oct. 24th, 2017Agenda• Recap: SVM and Hinge loss, Representer theorem• Stochastic Gradient Descent• Perceptron, other examplesKernelsRecall the feature transformation:For x ∈ X, y ∈ Y, prediction function is:f : X → YLet φ be the feature transformation function, then we have a mapping:x → φ(x)After the transformation, we have:f(x) = wTφ(x) + w0Note heref(x) is a linear function with respect toφ(x), and it’s a non-linear function withrespect to x. (Under the condition that φ(x) is a non-linear function)Define kernel function as:K(xi, xj) = φT(xi)φ(xj)This is especially useful whenφ(x) is infinite dimensional. The motivation of infinite di-mensionalφ(x) is that in theory, you can almost always make your data points linearly12 14 : Online Learning, Stochastic Gradient Descent, PerceptronFigure 1: Data lift to higher dimension becomes linear separable, from Kim (2013)spreadable, by bending the data space to infinite dimensional space. See Figure 1When φ(x) has infinite dimensions, solving w is hard:wTφ(x)where w is also infinite dimensional. Since usually the value of f (x) is in the form of:ΦT(x)Φ(x)we can use kernel function to represent f (x), without knowing φ(x) and wFor a given data point,f(x) = wTφ(x)f(x) =nXi=1K(x, xi)αiWe’ve shown 2 applications of kernels:• Linear Regression– Using a linear algebra identity• SVM– Using the method of Lagrange multipliers14 : Online Learning, Stochastic Gradient Descent, Perceptron 3Hilbert SpacesHkis the space of well behaved linear functions likewTφ(x). And this is equivalent to thespace of bounded functions:f(x) =∞Xi=1αiK(x, xi)K(xi, xj) = φT(xi)φ(xj)Support Vector MachinesThe optimization problem is to maximize the margin, which is proportional to1||w||:min||w||222s.t yif(xi) ≥ 1, ∀i ∈ [n]where f(x) = wTxWhy set the threshold to 1? Suppose change that to some constant C:yif(xi) ≥ Cyif(xi)C≥ 1yiwTCxi≥ 1This won’t change the solution to our optimization problem:min||w||222s.t yif(xi) ≥ 1, ∀i ∈ [n]So we can just set C = 1 without losing generality.Non-spreadable DataRecall that with the slack variable ξi, we have:minw,ξi||w||22+nXi=1ξis.t ξi≥ 0,yf (xi) ≥ 1 − ξi4 14 : Online Learning, Stochastic Gradient Descent, PerceptronFigure 2: Visualize SVM with slack variable (data points with circle are on the margin)Intuitively,ξirepresents how much tolerance to misclassification the model has. See Figure 2f(x) = wTφ(x)w∗=nXi=1αiyiφ(xi) where αi≥ 0w∗=nXi=1eαiφ(xi, where eαi= αiyiWhen data is separable (no slack), we haveαi6= 0 iff yf (x) = 1For the non-separable case, we haveαi> 0 iff yif(xi) = 1 − ξiotherwise αi= 0The prediction function can be written as this form:f(x) =nXiαiK(x, xi)Note that storage for kernel function isO(nd)14 : Online Learning, Stochastic Gradient Descent, Perceptron 5Whereas storage for SVM isO(dS)Where S is the number of support vectorsThis is because we can ignore those entries inKwith value ofK(x, xi), wherexiis not asupport vector.Representer TheoremThe optimization problem in the Hilbert space is:minf∈HknXi=1l(yi, f(xi)) + λ||f||2HkWhere l(yi, f(xi) is the loss function.Representer Theorem says the solution of this can be written as:f∗(x) =nXi=1αiK(x, xi)We have shown that• In kernel ridge regression, use matrix identity we have:l(yi, f(xi) = (yi− f(xi))2• In kernel SVM, use dual representation (Lagrange Multipliers) we have:l(yi, f(xi) = max(0, 1 − yif(xi))This also applies to other kernel algorithms, for example• Kernel logistic regression• Kernel Poisson regression6 14 : Online Learning, Stochastic Gradient Descent, PerceptronStochastic Gradient DescentAll the loss functions we’ve seen so far is average loss over data points:L(w) =1nnXi=1Li(w)In regression we have:Li(w) = (yi− wTxi− w0)2In SVM we have:Li(w) = max(0, 1 − yi(wTxi+ w0))The gradient is defined as:∇wL(w) =1nnXi=1∇wLi(w)The population empirical risk is:L(w) = EP[L(w)]Under weak conditions we have:∇L(w) = EP[∇wL(w)]We can replace EP[∇wL(w)] using a unbiased value, meaning we can pick a β such that:bias = E[β] − EP[∇wL(w)] = 0For example, we can choose mean (µ)Let X ∼ P, µ = E[X]E[µ] = E"1nnXi=1xi#= E[X]We can also choose a single example xifrom the data points, becauseE[xi] = E[X]So we can choose ∇wLi(w) as an unbiased estimator of E[∇wL(w)]Then our Stochastic Gradient Descent algorithm is defined as:In the tth iteration:14 : Online Learning, Stochastic Gradient Descent, Perceptron 7• Pick i uniformly from [n]• Update parameter wt+1= wt+ ηt∇wLi(w)•go to next iteration until hit stop condition. For example we can limit the number ofiterationsThat is, in each iteration we randomly pick a data point (yi, xi), and use that data point tocalculate Li(w) and ∇Li(w). Then use this ”randomly selected” gradient to update w.For example in the linear regression case:Li(w) = (yi− wTxi)2∇Li(w) = −2xi(yi− wTxi)wt+1= wt+ ηt(−2xi(yi− wTxi))Some useful properties of Stochastic Gradient Descent:•Will converge eventually, ifL(w) is convex and appropriately choose step size. SeeFigure 3 and Figure 4• Easily deal with streaming data for online learning.• Adaptive. See Figure 5• Deal with non-differentiable functions easily. See Figure 6Figure 3: Behavior of gradient descent, from Wikipedia, the free encyclopedia (2007).The intuition is:SGD ≈ GD + noise∇wLi(w) ≈ ∇wL(w) + noise8 14 : Online Learning, Stochastic Gradient Descent, PerceptronFigure 4: Behavior of stochastic gradient descentFigure 5: SGD model shifts while get new data (new data are the bigger ones)PerceptronPerceptron is an algorithm used for linear classification. If data is separable, perceptron willfind a linear separator.14 : Online Learning, Stochastic Gradient Descent, Perceptron 9Figure 6: GD gets stuck at the flat point, but SGD won’tConsider the linear model wheref(x) = wTx + w0h(x) = signf(x)The loss function is:Li(w) = max0, −yf (x)Then we have the gradient:∇wLi(w) =0, yif(xi) > 0−yixi, yif(xi) < 00, yif(xi) = 0The perceptron algorithm is defined as:For i ∈ [n], the update rule in each iteration is:wt+1=(wt+ ηtyixi, if yif(xi) < 0 (wrong prediction)wt, if yif(xi) > 0 (correct prediction)10 14 : Online Learning, Stochastic Gradient


View Full Document

ILLINOIS CS 446 - 101717.1

Download 101717.1
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view 101717.1 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view 101717.1 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?