CS446: Machine Learning, Fall 2017Lecture 15 : Stochastic Gradient Descent and PerceptronLecturer: Sanmi Koyejo Scribe: Qiwen Wang, Oct. 17th, 2017RecapFeature mappingLet f be a mapping and φ be a feature function wheref : x → yx → φ(x).For example, consider a feature functionφthat the separate points inside and outside of acircle. The mapping function then becomesf(x) = wTφ(x).Here,fis a linear function with respect toφ(x), but a nonlinear function with respect tox(iffis nonlinear). The figure on the left of Figure 1 can not represent the separatinghyperplane with linear function, while after feature mapping, we can represent the separatinghyperplane with linear function.Figure 1: Upon transform the figure with feature functionφ, we are able to separate pointswith a hyperplane of a linear function.12 15 : Stochastic Gradient Descent and PerceptronMercer KernelThe kernel betweenxiandxjis the inner product between the corresponding feature vectors,meaningκ(xi, xj) = φT(xi)φ(xj).In particular, the kernel function is useful when φ(x) is infinite dimensional. Sincef(x) = wTφ(x) =nXi=1κ(x, xi)αi,whenφ(x) is infinite dimensional, or has a high dimension, the kernel approach is moreefficient to compute.Hilbert SpaceHilbert spaceHkis a space of bounded linear functionswTφ(x). It is equivalent to say thatHilbert space a space of bounded linear functions f(x) =Pni=1κ(x, xi)αi.Support Vector MachineFull the fully separable case, the goal is to findminw||w||222such thatyif(xi) ≥ 1,for all i ∈ [n]. It is same as finding w that maximize this margin1||w||.Why we set the threshold to be 1?Note that f(x) = wTx. Assume thatyif(xi) ≥ 10yi(wi10)xi≥ 1(1)Letw0=wT10, our new goal is to minimize||w0||222such thatyif(xi)≥1. Therefore, it doesn’tmatter what constant we set for the threshold.15 : Stochastic Gradient Descent and Perceptron 3For the non-separable cases, the goal is to findminw,ξ||w||222+nXi=1ξisuch thatξi≥ 0yif(xi) ≥ 1 − ξifor all i ∈ [n]. We call ξithe slack variable that captures the error in the prediction.Figure 2: The circled data points are support vectors. There’s a point with negative labelfailed to be classified as negative. It receives ηnpenalty.The solution w∗could be written in the from:w∗=nXi=1αiyiφ(xi),where αi≥ 0 for all i ∈ [n]. αiis non-zero when corresponding points are support vectors.When data is separable, and there is no slack, the support vectors are all on the boundary,meaningαi6= 0 if, and only if yif(xi) = 1.When the data is non-separable, support vectors arexis.t. yif(xi) = 1 − ξi.Support vectors for non-separable data are not necessarily on the boundary.4 15 : Stochastic Gradient Descent and PerceptronWhy is SVD important?Storage for kernel function needs O(nd).Storage for SVM only needsO(d ×(#SU)), where #SUis the number of support vectors.Thus using SVM reduces the storage a lot.Representation TheoremWe can write the the optimization function in terms of Hilbert space Hk:minf∈HknXi=1l(yi, f(xi)) + λ||f||2HkWe can substitute l with different loss function.If we choose to use the cost function for kernel ridge regression, thenl(yi, f(xi) = (yi− f(xi))2.If we choose to use the cost function for kernel SVM, thenl(yi, f(xi) = max(0, 1 − yif(xi)).The representation theorem argues that we don’t need to write down the loss function, butinstead, the optimal function will be in the form off∗(x) =nXi=1αiκ(x, xi),then we can optimizeαidirectly. Such techniques can be applied to kernel logistic regression,kernel poisson regression, etc.Stochastic Gradient DescentAll the loss function so far can be written asl(w) =1nnXi=1li(w).We solve the above problem is by using gradient descend5wl(w) =1nnXi=15wli(w),15 : Stochastic Gradient Descent and Perceptron 5which averages gradients over all samples.If we don’t look at the data in the training set, we look at thepopulation empirical risk,l(w) = Ep[l(w)].Under week conditions, the gradient descent is5l(w) = Ep[5wl(w)].The gradient descent approximates the actual population empirical risk.Then we want to replace5wl(w) with an unbiased approximation. This means we pickβsuch thatE[β] − Ep[5wl(w)] = 0.The unbiased estimator for SGD is 5wl(w).Pick i uniformly from [n], the update rule for SGD iswt+1= wt+ ηt5wli(w).For example, pick one random sample (yi, xi); let the loss function be the square loss,li(w) = (yi− wTxi)2,take the derivative on both side, we have5wl(w) = −2xi(yi− wTxi).Thus the update rule for square loss iswt+1= wt+ ηt(−2xi(yi− wTxi)).Intuitively, we can see SGD as GD plus noise.Useful properties:• It converges if l(w) is convex, and if the step size is chosen appropriately.• Online learning: Easily deal with streaming data.• Adaptive.• Be able to deal with non-differentiable function easily.6 15 : Stochastic Gradient Descent and PerceptronFigure 3: Gradient Descent. The direction is always pointing towards the optimal point.Figure 4: Stochastic Gradient Descent. The direction is mostly pointing towards the optimalpoint, but includes some ”noise”.PerceptronPerceptron is used for linear classification that uses a linear functionf(x) =wTx+w0. Theprediction h(x) is based on the sign of f(x), meaningh(x) = sign(f(x)).The inequalityyif(xi)<0 means wrong prediction. We update the rule when the predictionis wrong. Thereforewt+1=(wt+ ηtyixi, if yif(xi) < 0 ≡ wrong prediction0. otherwise15 : Stochastic Gradient Descent and Perceptron 7The loss function for every sample isli(w) = max(0, −yif(xi)).We apply gradient descent on the loss function, then we haveddw(max(0, −yif(xi))) =0, yif(xi) > 0−yixi, yif(xi) < 00/doesn’t matter. yif(xi) = 0If data is separable, then the perceptron will find a linear
View Full Document