DOC PREVIEW
ILLINOIS CS 446 - 101717.2

This preview shows page 1-2 out of 7 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 7 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 7 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 7 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

CS446: Machine Learning, Fall 2017Lecture 15 : Stochastic Gradient Descent and PerceptronLecturer: Sanmi Koyejo Scribe: Qiwen Wang, Oct. 17th, 2017RecapFeature mappingLet f be a mapping and φ be a feature function wheref : x → yx → φ(x).For example, consider a feature functionφthat the separate points inside and outside of acircle. The mapping function then becomesf(x) = wTφ(x).Here,fis a linear function with respect toφ(x), but a nonlinear function with respect tox(iffis nonlinear). The figure on the left of Figure 1 can not represent the separatinghyperplane with linear function, while after feature mapping, we can represent the separatinghyperplane with linear function.Figure 1: Upon transform the figure with feature functionφ, we are able to separate pointswith a hyperplane of a linear function.12 15 : Stochastic Gradient Descent and PerceptronMercer KernelThe kernel betweenxiandxjis the inner product between the corresponding feature vectors,meaningκ(xi, xj) = φT(xi)φ(xj).In particular, the kernel function is useful when φ(x) is infinite dimensional. Sincef(x) = wTφ(x) =nXi=1κ(x, xi)αi,whenφ(x) is infinite dimensional, or has a high dimension, the kernel approach is moreefficient to compute.Hilbert SpaceHilbert spaceHkis a space of bounded linear functionswTφ(x). It is equivalent to say thatHilbert space a space of bounded linear functions f(x) =Pni=1κ(x, xi)αi.Support Vector MachineFull the fully separable case, the goal is to findminw||w||222such thatyif(xi) ≥ 1,for all i ∈ [n]. It is same as finding w that maximize this margin1||w||.Why we set the threshold to be 1?Note that f(x) = wTx. Assume thatyif(xi) ≥ 10yi(wi10)xi≥ 1(1)Letw0=wT10, our new goal is to minimize||w0||222such thatyif(xi)≥1. Therefore, it doesn’tmatter what constant we set for the threshold.15 : Stochastic Gradient Descent and Perceptron 3For the non-separable cases, the goal is to findminw,ξ||w||222+nXi=1ξisuch thatξi≥ 0yif(xi) ≥ 1 − ξifor all i ∈ [n]. We call ξithe slack variable that captures the error in the prediction.Figure 2: The circled data points are support vectors. There’s a point with negative labelfailed to be classified as negative. It receives ηnpenalty.The solution w∗could be written in the from:w∗=nXi=1αiyiφ(xi),where αi≥ 0 for all i ∈ [n]. αiis non-zero when corresponding points are support vectors.When data is separable, and there is no slack, the support vectors are all on the boundary,meaningαi6= 0 if, and only if yif(xi) = 1.When the data is non-separable, support vectors arexis.t. yif(xi) = 1 − ξi.Support vectors for non-separable data are not necessarily on the boundary.4 15 : Stochastic Gradient Descent and PerceptronWhy is SVD important?Storage for kernel function needs O(nd).Storage for SVM only needsO(d ×(#SU)), where #SUis the number of support vectors.Thus using SVM reduces the storage a lot.Representation TheoremWe can write the the optimization function in terms of Hilbert space Hk:minf∈HknXi=1l(yi, f(xi)) + λ||f||2HkWe can substitute l with different loss function.If we choose to use the cost function for kernel ridge regression, thenl(yi, f(xi) = (yi− f(xi))2.If we choose to use the cost function for kernel SVM, thenl(yi, f(xi) = max(0, 1 − yif(xi)).The representation theorem argues that we don’t need to write down the loss function, butinstead, the optimal function will be in the form off∗(x) =nXi=1αiκ(x, xi),then we can optimizeαidirectly. Such techniques can be applied to kernel logistic regression,kernel poisson regression, etc.Stochastic Gradient DescentAll the loss function so far can be written asl(w) =1nnXi=1li(w).We solve the above problem is by using gradient descend5wl(w) =1nnXi=15wli(w),15 : Stochastic Gradient Descent and Perceptron 5which averages gradients over all samples.If we don’t look at the data in the training set, we look at thepopulation empirical risk,l(w) = Ep[l(w)].Under week conditions, the gradient descent is5l(w) = Ep[5wl(w)].The gradient descent approximates the actual population empirical risk.Then we want to replace5wl(w) with an unbiased approximation. This means we pickβsuch thatE[β] − Ep[5wl(w)] = 0.The unbiased estimator for SGD is 5wl(w).Pick i uniformly from [n], the update rule for SGD iswt+1= wt+ ηt5wli(w).For example, pick one random sample (yi, xi); let the loss function be the square loss,li(w) = (yi− wTxi)2,take the derivative on both side, we have5wl(w) = −2xi(yi− wTxi).Thus the update rule for square loss iswt+1= wt+ ηt(−2xi(yi− wTxi)).Intuitively, we can see SGD as GD plus noise.Useful properties:• It converges if l(w) is convex, and if the step size is chosen appropriately.• Online learning: Easily deal with streaming data.• Adaptive.• Be able to deal with non-differentiable function easily.6 15 : Stochastic Gradient Descent and PerceptronFigure 3: Gradient Descent. The direction is always pointing towards the optimal point.Figure 4: Stochastic Gradient Descent. The direction is mostly pointing towards the optimalpoint, but includes some ”noise”.PerceptronPerceptron is used for linear classification that uses a linear functionf(x) =wTx+w0. Theprediction h(x) is based on the sign of f(x), meaningh(x) = sign(f(x)).The inequalityyif(xi)<0 means wrong prediction. We update the rule when the predictionis wrong. Thereforewt+1=(wt+ ηtyixi, if yif(xi) < 0 ≡ wrong prediction0. otherwise15 : Stochastic Gradient Descent and Perceptron 7The loss function for every sample isli(w) = max(0, −yif(xi)).We apply gradient descent on the loss function, then we haveddw(max(0, −yif(xi))) =0, yif(xi) > 0−yixi, yif(xi) < 00/doesn’t matter. yif(xi) = 0If data is separable, then the perceptron will find a linear


View Full Document

ILLINOIS CS 446 - 101717.2

Download 101717.2
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view 101717.2 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view 101717.2 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?