DOC PREVIEW
ILLINOIS CS 446 - 101917.2

This preview shows page 1-2 out of 6 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 6 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 6 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 6 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

CS446: Machine Learning, Fall 2017Lecture 15 :Multi-layer Perceptron and BackpropagationLecturer: Sanmi Koyejo Scribe: Yihui Cui, Oct. 19th, 2017Agenda• Recap of SGD, Perceptron• Backpropagation• Multi-Layer PerceptronRecapSGD• Recall– Empirical loss l(w) =1nPni=1li(w)– R(f(w), P ) = Ep[l(h(w)]– for Gradient Descent: wt+1= wt+ 5wR(f(w), p)– under weak condition5Ep[l(w)] = Ep[5wl(w)]– we need an unbiased estimator 5wli(w)• AlgorithmsOne way to optimize stochastic objective such asEp[l(h(w)] is to perform the updateat each step.See Algorithms 1 for pseudocode.• Mini-BatchSGD has high variance. To reduce variance, we use multiple samples. As known asMini-Batch. We compute gradient of a mini-batch of k data cases, then take average.If k=1, this is SGD, if k=N, this is standard steepest descent.• Comparision(Tradeoff)12 15 :Multi-layer Perceptron and BackpropagationAlgorithm 1 Stochastic Gradient DescentInitialize θ, ηrepeatRandomly permuta datafor i = 1 to n dog = 5f(θ, zi);θ ←− proj(θ − ηg);update η;end foruntil converge– Gradient DescentComputation Cost: N ∗ δHigh memoryGenerally converge fast– SGDComutation Cost: δless likely to get stuck in flat regions– Mini-batch SGD(With size k)Computational cost: KδLess likely to get stuck in flat regions.δ stands for cost for 1 gradient.• Find Parameter Estimator(T-step)– Pick the final value wT–1TPTt=1wt–1sPst=T −swt• Seting the step sizeIn order to guarantee convergence of SGD, there are some sufficient conditions on thelearning rate, which are known as Robbins-Monro conditions.∞Xk=1ηk= ∞,∞Xk=1η2k< ∞Choice of stepsize η:The set values ifηover time is called learning rate schedule. There are different waysto choose learning rate.15 :Multi-layer Perceptron and Backpropagation 3– ηk= 1/k– ηk= (k+β)−mβ≥0 slows down early iterations of algorithsm, andm ∈ (0.5, 1] controls the rate at which old values are forgotten.– ηk= e−αtThe need to adjust these tuning parameters is one of the main drawback ofstochasticoptimization. One simple heuristic is as follows: store an initial subsetof thedata, and try a range ofηvalues on this subset; then choose the one thatresults in the fastestdecrease in the objective and apply it to all the rest of thedata. Note that this may not resultin convergence, but the algorithm can beterminated when the performance improvement on ahold-out set plateaus (this iscalled early stopping).Back Propagation• RecallChain Rule:∂f(g(w))/w∂w=∂f∂g∂g∂w∂(x + y)∂x= 1∂(xy)∂x= y∂max(x, y)∂x=(1, if x ≥ y0, if otherwise• Formal AnalysisLets definexnas n-th input,an=Vxn, be the pre-synaptic hidden layer. g is sometransfer function,zn=g(an) be the post-synaptic hidden layer. At last, let us convertthe hidden layer to the output layer:bn=Wznis pre-synaptic output layer, andˆyn= h(bn) be the post-synaptic output layer. The overall pipeline is as follows:xnV−→ ang−→ znW−→ bnh−→ ˆynThe parameter of the model areθ= (V, W). And letJ(θ) represents NLL. And ourgoal is to compute 5θJn.For the out layer, we have:5wkJn=∂Jn∂bnk5wkbnk=∂Jn∂bnkzn= δwnkzn)Noticing bnk= wTkznand define δwnk= ( ˆynk− ynk)4 15 :Multi-layer Perceptron and BackpropagationFor the input layer we have:5vjJn=∂Jn∂anj5vjanj= δvnjxnδvnj=∂Jn∂ank=KXk=1∂Jn∂bnk∂bnk∂anj=KXk=1δwnk∂bnk∂anjsince bnk=Pjwjkg(anj) we can get:∂bnk∂anj= wkjg0(anj)put everything together, we have :δvnj=KXk=1δwnkwkjg0(anj)So the layer 1 errors can be computed by passing the layer 2 errors back through theW matrix. We compute gradients locally, each node only needs to know about itsimmediate neighbors. In another word, we first perform a forward pass to computean, zn, bn, ˆyn. Then compute the error for the output layer, then pass backwards tocompute error for the hidden layer. FInally we compute the overall gradient vector bystacking the two component vectors as follows:5θJ(θ) = [Xnδvnxn;Xnδwnzn]Let us illustrate the Back Propagation with a simple graph exampledefine f(w, x) asf(w, x) = max(0, w1x1+ w2x2+ w3)Then we have:∂f(w, x)∂w1=1[w1x1+ w2x2+ w3> 0]x1∂f(w, x)∂w2=1[w1x1+ w2x2+ w3> 0]x2∂f(w, x)∂w1=1[w1x1+ w2x2+ w3> 0]1for w1= 5, w2= 3, w3= 4, x1= 1, x2= −2w1x1+ w2x2+ w3= 3∂f∂w1= 115 :Multi-layer Perceptron and Backpropagation 5∂f∂w2= −2∂f∂w3= 1The whole Forward/Backward propagation are illustrates as following figure:2.Figure 1: Forward/Backward Pass exampleMultilayer PerceptronA multilayer perceptron is a series of logistic or regression model stacked on top of eachother, with final layer being either another logistic regression or a linear model. For exampleif we have two layer, and we are solving a regression problem, the model has the form:p(y | x,w) =N(y | wTz(x), σ2)z(x) = g(Vx) = [g(vT1x), · · ·g(vTHx)]in binary classification, we pass the output through a sigmoid, as in a GLM:p(y | x, w) = Ber(y | sigm(wTz(x)))z(x) = g(Vx) = [g(vT1x), · · ·g(vTHx)]2.6 15 :Multi-layer Perceptron and BackpropagationFigure 2: Multilayer


View Full Document

ILLINOIS CS 446 - 101917.2

Download 101917.2
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view 101917.2 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view 101917.2 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?