# ILLINOIS CS 446 - 101917.2 (6 pages)

Previewing pages 1, 2 of 6 page document
View Full Document

## 101917.2

Previewing pages 1, 2 of actual document.

View Full Document
View Full Document

## 101917.2

51 views

Pages:
6
School:
University of Illinois - urbana
Course:
Cs 446 - Machine Learning
##### Machine Learning Documents
Unformatted text preview:

CS446 Machine Learning Fall 2017 Lecture 15 Multi layer Perceptron and Backpropagation Lecturer Sanmi Koyejo Scribe Yihui Cui Oct 19th 2017 Agenda Recap of SGD Perceptron Backpropagation Multi Layer Perceptron Recap SGD Recall Empirical loss l w 1 n Pn i 1 li w R f w P Ep l h w for Gradient Descent wt 1 wt 5w R f w p under weak condition5Ep l w Ep 5w l w we need an unbiased estimator 5w li w Algorithms One way to optimize stochastic objective such as Ep l h w is to perform the update at each step See Algorithms 1 for pseudocode Mini Batch SGD has high variance To reduce variance we use multiple samples As known as Mini Batch We compute gradient of a mini batch of k data cases then take average If k 1 this is SGD if k N this is standard steepest descent Comparision Tradeoff 1 2 15 Multi layer Perceptron and Backpropagation Algorithm 1 Stochastic Gradient Descent Initialize repeat Randomly permuta data for i 1 to n do g 5f zi proj g update end for until converge Gradient Descent Computation Cost N High memory Generally converge fast SGD Comutation Cost less likely to get stuck in flat regions Mini batch SGD With size k Computational cost K Less likely to get stuck in flat regions stands for cost for 1 gradient Find Parameter Estimator T step Pick the final value wT P T1 Tt 1 wt P 1s st T s wt Seting the step size In order to guarantee convergence of SGD there are some sufficient conditions on the learning rate which are known as Robbins Monro conditions X k 1 k X k2 k 1 Choice of stepsize The set values if over time is called learning rate schedule There are different ways to choose learning rate 15 Multi layer Perceptron and Backpropagation 3 k 1 k 0 slows down early iterations of algorithsm and k k m m 0 5 1 controls the rate at which old values are forgotten k e t The need to adjust these tuning parameters is one of the main drawback of stochasticoptimization One simple heuristic is as follows store an initial subset of thedata and try a range of values on this subset then choose the one that results in the fastestdecrease in the objective and apply it to all the rest of the data Note that this may not resultin convergence but the algorithm can be terminated when the performance improvement on ahold out set plateaus this is called early stopping Back Propagation Recall Chain Rule f g w w f g w g w x y 1 x xy y x 1 if x y max x y x 0 if otherwise Formal Analysis Lets define xn as n th input an Vxn be the pre synaptic hidden layer g is some transfer function zn g an be the post synaptic hidden layer At last let us convert the hidden layer to the output layer bn Wzn is pre synaptic output layer and y n h bn be the post synaptic output layer The overall pipeline is as follows V g W h xn an zn bn y n The parameter of the model are V W And let J represents NLL And our goal is to compute 5 Jn For the out layer we have 5wk Jn Jn Jn w 5wk bnk zn nk zn bnk bnk w y y Noticing bnk wkT zn and define nk nk nk 4 15 Multi layer Perceptron and Backpropagation For the input layer we have Jn v 5vj anj nj xn anj 5vj Jn v nj since bnk P j K K k 1 k 1 X Jn bnk X Jn w bnk nk ank bnk anj anj wjk g anj we can get bnk 0 wkj g anj anj put everything together we have v nj K X 0 w nk wkj g anj k 1 So the layer 1 errors can be computed by passing the layer 2 errors back through the W matrix We compute gradients locally each node only needs to know about its immediate neighbors In another word we first perform a forward pass to compute an zn bn y n Then compute the error for the output layer then pass backwards to compute error for the hidden layer FInally we compute the overall gradient vector by stacking the two component vectors as follows X X 5 J nv xn nw zn n n Let us illustrate the Back Propagation with a simple graph example define f w x as f w x max 0 w1 x1 w2 x2 w3 Then we have f w x 1 w1 x1 w2 x2 w3 0 x1 w1 f w x 1 w1 x1 w2 x2 w3 0 x2 w2 f w x 1 w1 x1 w2 x2 w3 0 1 w1 for w1 5 w2 3 w3 4 x1 1 x2 2 w1 x 1 w2 x 2 w3 3 f 1 w1 15 Multi layer Perceptron and Backpropagation 5 f 2 w2 f 1 w3 The whole Forward Backward propagation are illustrates as following figure 2 Figure 1 Forward Backward Pass example Multilayer Perceptron A multilayer perceptron is a series of logistic or regression model stacked on top of each other with final layer being either another logistic regression or a linear model For example if we have two layer and we are solving a regression problem the model has the form p y x w N y wT z x 2 T z x g Vx g v1T x g vH x in binary classification we pass the output through a sigmoid as in a GLM p y x w Ber y sigm wT z x T z x g Vx g v1T x g vH x 2 6 15 Multi layer Perceptron and Backpropagation Figure 2 Multilayer Peceptron

View Full Document

## Access the best Study Guides, Lecture Notes and Practice Exams Unlocking...