DOC PREVIEW
ILLINOIS CS 446 - 101917.1

This preview shows page 1-2-3 out of 9 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 9 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 9 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 9 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 9 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

CS446: Machine Learning, Fall 2017Lecture 15 : Backpropagation and Multi-layer PerceptronLecturer: Sanmi Koyejo Scribe: Zhenfeng Chen, Oct. 19th, 2017Agenda• Recap: SGD, Perceptron• Backpropagation• Multi-layer PerceptronRecap• Stochastic Gradient Descent(SGD)The risk function can usually be approximated by the empirical risk, which equals tothe average of the loss of samples.l(w) =1nnXi=1li(w)R(f(w), p) = Ep[l(w)]where w is the model parameter, f(w) is a classifier.wt+1= wt+ OwR(·, p)Therefore we can get a good estimator ofOwR(·, p) for gradient descent by computingthe empirical average of the gradient.• Some properties(a) Under weak condition, we haveOwEp[l(w)] = Ep[Owl(w)].It means, for most of the loss functions, we noticed that the gradient of theexpectation of loss functions equals to the expectation of the gradient of the lossfunction.12 15 : Backpropagation and Multi-layer Perceptron(b) SGD needs an unbiased estimatorOwli(w).The gradient of one sample is an unbiased estimator of the gradient so that tosolve the problem in (a), we only need an unbiased estimator of gradient instead ofthe ’true’ gradient. That is why in SGD, the true gradient of the loss function can beapproximated by the gradient of one single example.Question: How to prove an estimator is unbiased?Prove:0) Suppose x ∼ p,1) We samplex1, x2, ..., xnfrom the population to get an estimatorT(x1, x2, ..., xn).2) When T (x1, x2, ..., xn) = xi, i ∈ (0, n], we have E[xi] = E[x] = x1.3) Therefore we haveE[T ] =1nnXi=1E[xi]=1nnXi=1E[x].By far we prove that the estimator we generate from a dataset is unbiased, becauseE[T ] − E[x] = 0.(c) SGD has high variance.Question: How to reduce variance?To reduce variance, we can sample data randomly from dataset.• Mini-Batch SGD/ multiple samples SGDAlgorithm 1 Mini-batch SGDDn{x1, ..., xn}, randomly shuffle datafor j in range(epochs) dog(w) ←1kP(k+1)ji=kjOwli(w)wj+1← wj+ ηjg(w)end forQuestion: Why do we need to randomly shuffle data?Answer:It avoids correlation in sample order. For instance, real-data are not i.i.d.(independent and identically distributed). Correlated data are usually store together.15 : Backpropagation and Multi-layer Perceptron 3• Bias of Mini-BatchSuppose T (x1, x2, ..., xn) =1nPni=1xi, we haveBias(T ) = E[T ] − E[x]= E[1nnXi=1xi] − E[x]=1nnXi=1E[xi] − E[x]=1nnXi=1xi− E[x]= 0•What is the tradeoff when we use SGD and Mini-batch SGD instead of gradientdescent? To compare with gradient descent, SGD and mini-batch SGD, we haveGradient Descent:To compute gradient for each iteration, the computational cost isNγ, whileγisthe cost of computing gradient for a single sample.Pro:- Gradient descent, also known as steepest descent, generally converges fasterthan SGD and mini-batch SGD. You can think about GD can always find the shortestpath to the global optimal if it doesn’t get stuck at the local optimal.Con:- It requires higher memory to store gradients.- Gradient descent has higher computational cost. The computation cost of GDis expensive when the dataset is huge.SGD:The computational cost of SGD is γ.Pro:- SGD is less likely to get stuck at flat regions, because it has certain amount of”noise” Murphy (2012) to help it get rid of the local optimum.Con:- SGD has slower convergence rate in general.Mini-batch SGD:The computational cost of mini-batch SGD is kγ, k is the size of samples.Pro:- Mini-batch SGD is less likely to get stuck at flat regions.4 15 : Backpropagation and Multi-layer Perceptron• Subway: How to estimate the final parameter?1 Pick the last value wt, which may not exactly be the global optimal.21TPTt=1wt. Average the entire parameter sequence. If T is large enough, the firstfew steps will end up convergence quickly.3 Choose a local average, e.g,1sPst=T −swt• How to set the step size for SGD?A proper step size should have the properties, also known as the Robbin-Monroconditions Murphy (2012), that∞Xt=1ηt= ∞,∞Xt=1η2t< ∞.Some choices for step size:ηt=1t, ηt=1(t + b)m, ηt= e−αtIf the step size is decreasing with an appropriate rate, SGD mostly converges to aglobal optimum. Otherwise it might converge to a local minimum.Question: Why the step size is decreasing?Answer: We need small adjustments near optimal point to avoid overshooting.Backpropagation• Chain rule∂f (g(w))∂w=∂f∂g∂g(w)∂w• Some gradients∂(x + y)∂x= 1,∂(xy)∂x= y,∂max(x, y)∂x=(1 x ≥ y0 x < y• Example 1, see Figure 1f(x, y, z) = (x + y)z,∂f∂x= z,∂f∂y= z,∂f∂z= x + y15 : Backpropagation and Multi-layer Perceptron 5Figure 1: function graph of f (x, y, z)Given (x, y, z) = (-2, 5, -4), in forward pass, we havex + y = 3,(x + y)z = −12.In backward pass, we can compute the gradients.∂f∂(x + y)= z = −4∂f∂z= x + y = 3∂f∂x=∂f∂(x + y)∂(x + y)∂x= −4∂f∂y=∂f∂(x + y)∂(x + y)∂y= −4• Example 2, see Figure 2f(w, x) =11 + e−(w1x1+w2x2+w3)To compute w1, w2, w3, gradients we need:∂(a + b)∂a= 1,∂a · b∂a= b,∂σ(z)∂z= σ(z)(1 − σ(z)),∂(−x)∂x= −1• Example 3, see Figure 3f(w, x) = max(0, w1x1+ w2x2+ w3)6 15 : Backpropagation and Multi-layer PerceptronFigure 2: function graph of f (w, x)Figure 3: function graph of f (w, x)Gradients we need:∂∂zmax(0, z) = 1[z≥0],∂(a + b)∂a= 1,∂a · b∂a= bGiven (x1, x2, w1, w2, w3) = (1, −2, 5, 3, 4), in forward pass we have,w1x1= 5, w2x2= −6,w1x1+ w2x2= −1,w1x1+ w2x2+ w3= 3,max(0, w1x1+ w2x2+ w3) = 3.15 : Backpropagation and Multi-layer Perceptron 7In backward pass,∂f∂(w1x1+ w2x2+ w3)= 1[(w1x1+w2x2+w3)≥0]= 1,∂f∂w1x1=∂f∂(w1x1+ w2x2+ w3)∂(w1x1+ w2x2+ w3)∂(w1x1)= 1,∂f∂w2x2=∂f∂(w1x1+ w2x2+ w3)∂(w1x1+ w2x2+ w3)∂(w2x2)= 1,∂f∂w3=∂f∂(w1x1+ w2x2+ w3)∂(w1x1+ w2x2+ w3)∂w3= 1,∂f∂w1=∂f∂(w1x1+ w2x2+ w3)∂(w1x1+ w2x2+ w3)∂w1x1∂w1x1∂w1= 1,∂f∂w2=∂f∂(w1x1+ w2x2+ w3)∂(w1x1+ w2x2+ w3)∂w2x2∂w2x2∂w2= −2.Multi-layer perceptronSuppose we have a linear function,y=wTx=Pdi=1wixi. In addition, we can apply anonlinear function to the transformation.y = σwTx= σ dXi=1wixi!In every iteration, we can take input from the previous step asys−1. We then apply alinear transformation and a nonlinear function toys−1. The output isys=σwTiys−1. Thearchitecture is known as multi-layer perceptron, see Figure 4.In matrix form, the function can be written asσ(W x).Wis a matrix of sizek × d, andwis a vector of size d × 1.8 15 :


View Full Document

ILLINOIS CS 446 - 101917.1

Download 101917.1
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view 101917.1 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view 101917.1 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?