ILLINOIS CS 446 - 101917.2 - D3425981

Home> Schools> University of Illinois - urbana> (CS) > CS 446> 101917.2

DOC PREVIEW

ILLINOIS CS 446 - 101917.2

School name University of Illinois - urbana

Course Cs 446- Machine Learning

Pages 6

This preview shows page 1-2 out of 6 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 6 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 6 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 6 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

CS446: Machine Learning, Fall 2017Lecture 15 :Multi-layer Perceptron and BackpropagationLecturer: Sanmi Koyejo Scribe: Yihui Cui, Oct. 19th, 2017Agenda• Recap of SGD, Perceptron• Backpropagation• Multi-Layer PerceptronRecapSGD• Recall– Empirical loss l(w) =1nPni=1li(w)– R(f(w), P ) = Ep[l(h(w)]– for Gradient Descent: wt+1= wt+ 5wR(f(w), p)– under weak condition5Ep[l(w)] = Ep[5wl(w)]– we need an unbiased estimator 5wli(w)• AlgorithmsOne way to optimize stochastic objective such asEp[l(h(w)] is to perform the updateat each step.See Algorithms 1 for pseudocode.• Mini-BatchSGD has high variance. To reduce variance, we use multiple samples. As known asMini-Batch. We compute gradient of a mini-batch of k data cases, then take average.If k=1, this is SGD, if k=N, this is standard steepest descent.• Comparision(Tradeoff)12 15 :Multi-layer Perceptron and BackpropagationAlgorithm 1 Stochastic Gradient DescentInitialize θ, ηrepeatRandomly permuta datafor i = 1 to n dog = 5f(θ, zi);θ ←− proj(θ − ηg);update η;end foruntil converge– Gradient DescentComputation Cost: N ∗ δHigh memoryGenerally converge fast– SGDComutation Cost: δless likely to get stuck in flat regions– Mini-batch SGD(With size k)Computational cost: KδLess likely to get stuck in flat regions.δ stands for cost for 1 gradient.• Find Parameter Estimator(T-step)– Pick the final value wT–1TPTt=1wt–1sPst=T −swt• Seting the step sizeIn order to guarantee convergence of SGD, there are some sufficient conditions on thelearning rate, which are known as Robbins-Monro conditions.∞Xk=1ηk= ∞,∞Xk=1η2k< ∞Choice of stepsize η:The set values ifηover time is called learning rate schedule. There are different waysto choose learning rate.15 :Multi-layer Perceptron and Backpropagation 3– ηk= 1/k– ηk= (k+β)−mβ≥0 slows down early iterations of algorithsm, andm ∈ (0.5, 1] controls the rate at which old values are forgotten.– ηk= e−αtThe need to adjust these tuning parameters is one of the main drawback ofstochasticoptimization. One simple heuristic is as follows: store an initial subsetof thedata, and try a range ofηvalues on this subset; then choose the one thatresults in the fastestdecrease in the objective and apply it to all the rest of thedata. Note that this may not resultin convergence, but the algorithm can beterminated when the performance improvement on ahold-out set plateaus (this iscalled early stopping).Back Propagation• RecallChain Rule:∂f(g(w))/w∂w=∂f∂g∂g∂w∂(x + y)∂x= 1∂(xy)∂x= y∂max(x, y)∂x=(1, if x ≥ y0, if otherwise• Formal AnalysisLets definexnas n-th input,an=Vxn, be the pre-synaptic hidden layer. g is sometransfer function,zn=g(an) be the post-synaptic hidden layer. At last, let us convertthe hidden layer to the output layer:bn=Wznis pre-synaptic output layer, andˆyn= h(bn) be the post-synaptic output layer. The overall pipeline is as follows:xnV−→ ang−→ znW−→ bnh−→ ˆynThe parameter of the model areθ= (V, W). And letJ(θ) represents NLL. And ourgoal is to compute 5θJn.For the out layer, we have:5wkJn=∂Jn∂bnk5wkbnk=∂Jn∂bnkzn= δwnkzn)Noticing bnk= wTkznand define δwnk= ( ˆynk− ynk)4 15 :Multi-layer Perceptron and BackpropagationFor the input layer we have:5vjJn=∂Jn∂anj5vjanj= δvnjxnδvnj=∂Jn∂ank=KXk=1∂Jn∂bnk∂bnk∂anj=KXk=1δwnk∂bnk∂anjsince bnk=Pjwjkg(anj) we can get:∂bnk∂anj= wkjg0(anj)put everything together, we have :δvnj=KXk=1δwnkwkjg0(anj)So the layer 1 errors can be computed by passing the layer 2 errors back through theW matrix. We compute gradients locally, each node only needs to know about itsimmediate neighbors. In another word, we first perform a forward pass to computean, zn, bn, ˆyn. Then compute the error for the output layer, then pass backwards tocompute error for the hidden layer. FInally we compute the overall gradient vector bystacking the two component vectors as follows:5θJ(θ) = [Xnδvnxn;Xnδwnzn]Let us illustrate the Back Propagation with a simple graph exampledefine f(w, x) asf(w, x) = max(0, w1x1+ w2x2+ w3)Then we have:∂f(w, x)∂w1=1[w1x1+ w2x2+ w3> 0]x1∂f(w, x)∂w2=1[w1x1+ w2x2+ w3> 0]x2∂f(w, x)∂w1=1[w1x1+ w2x2+ w3> 0]1for w1= 5, w2= 3, w3= 4, x1= 1, x2= −2w1x1+ w2x2+ w3= 3∂f∂w1= 115 :Multi-layer Perceptron and Backpropagation 5∂f∂w2= −2∂f∂w3= 1The whole Forward/Backward propagation are illustrates as following figure:2.Figure 1: Forward/Backward Pass exampleMultilayer PerceptronA multilayer perceptron is a series of logistic or regression model stacked on top of eachother, with final layer being either another logistic regression or a linear model. For exampleif we have two layer, and we are solving a regression problem, the model has the form:p(y | x,w) =N(y | wTz(x), σ2)z(x) = g(Vx) = [g(vT1x), · · ·g(vTHx)]in binary classification, we pass the output through a sigmoid, as in a GLM:p(y | x, w) = Ber(y | sigm(wTz(x)))z(x) = g(Vx) = [g(vT1x), · · ·g(vTHx)]2.6 15 :Multi-layer Perceptron and BackpropagationFigure 2: Multilayer

View Full Document


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2 out of 6 pages.

ILLINOIS CS 446 - 101917.2

Sign up for free to view:

Please select your school