# ILLINOIS CS 446 - 101917.2 (6 pages)

Previewing pages*1, 2*of 6 page document

**View the full content.**## 101917.2

Previewing pages *1, 2*
of
actual document.

**View the full content.**View Full Document

## 101917.2

0 0 43 views

- Pages:
- 6
- School:
- University of Illinois - urbana
- Course:
- Cs 446 - Machine Learning

**Unformatted text preview: **

CS446 Machine Learning Fall 2017 Lecture 15 Multi layer Perceptron and Backpropagation Lecturer Sanmi Koyejo Scribe Yihui Cui Oct 19th 2017 Agenda Recap of SGD Perceptron Backpropagation Multi Layer Perceptron Recap SGD Recall Empirical loss l w 1 n Pn i 1 li w R f w P Ep l h w for Gradient Descent wt 1 wt 5w R f w p under weak condition5Ep l w Ep 5w l w we need an unbiased estimator 5w li w Algorithms One way to optimize stochastic objective such as Ep l h w is to perform the update at each step See Algorithms 1 for pseudocode Mini Batch SGD has high variance To reduce variance we use multiple samples As known as Mini Batch We compute gradient of a mini batch of k data cases then take average If k 1 this is SGD if k N this is standard steepest descent Comparision Tradeoff 1 2 15 Multi layer Perceptron and Backpropagation Algorithm 1 Stochastic Gradient Descent Initialize repeat Randomly permuta data for i 1 to n do g 5f zi proj g update end for until converge Gradient Descent Computation Cost N High memory Generally converge fast SGD Comutation Cost less likely to get stuck in flat regions Mini batch SGD With size k Computational cost K Less likely to get stuck in flat regions stands for cost for 1 gradient Find Parameter Estimator T step Pick the final value wT P T1 Tt 1 wt P 1s st T s wt Seting the step size In order to guarantee convergence of SGD there are some sufficient conditions on the learning rate which are known as Robbins Monro conditions X k 1 k X k2 k 1 Choice of stepsize The set values if over time is called learning rate schedule There are different ways to choose learning rate 15 Multi layer Perceptron and Backpropagation 3 k 1 k 0 slows down early iterations of algorithsm and k k m m 0 5 1 controls the rate at which old values are forgotten k e t The need to adjust these tuning parameters is one of the main drawback of stochasticoptimization One simple heuristic is as follows store an initial subset of thedata and try a range of values on this subset then choose the one that results in the fastestdecrease in the objective and apply it to all the rest of the data Note that this may not resultin convergence but the algorithm can be terminated when the performance improvement on ahold out set plateaus this is called early stopping Back Propagation Recall Chain Rule f g w w f g w g w x y 1 x xy y x 1 if x y max x y x 0 if otherwise Formal Analysis Lets define xn as n th input an Vxn be the pre synaptic hidden layer g is some transfer function zn g an be the post synaptic hidden layer At last let us convert

View Full Document