# ILLINOIS CS 446 - 101917.1 (9 pages)

Previewing pages*1, 2, 3*of 9 page document

**View the full content.**## 101917.1

Previewing pages
*1, 2, 3*
of
actual document.

**View the full content.**View Full Document

## 101917.1

0 0 94 views

- Pages:
- 9
- School:
- University of Illinois - urbana
- Course:
- Cs 446 - Machine Learning

**Unformatted text preview:**

CS446 Machine Learning Fall 2017 Lecture 15 Backpropagation and Multi layer Perceptron Lecturer Sanmi Koyejo Scribe Zhenfeng Chen Oct 19th 2017 Agenda Recap SGD Perceptron Backpropagation Multi layer Perceptron Recap Stochastic Gradient Descent SGD The risk function can usually be approximated by the empirical risk which equals to the average of the loss of samples n l w 1X li w n i 1 R f w p Ep l w where w is the model parameter f w is a classifier wt 1 wt Ow R p Therefore we can get a good estimator of Ow R p for gradient descent by computing the empirical average of the gradient Some properties a Under weak condition we have Ow Ep l w Ep Ow l w It means for most of the loss functions we noticed that the gradient of the expectation of loss functions equals to the expectation of the gradient of the loss function 1 2 15 Backpropagation and Multi layer Perceptron b SGD needs an unbiased estimator Ow li w The gradient of one sample is an unbiased estimator of the gradient so that to solve the problem in a we only need an unbiased estimator of gradient instead of the true gradient That is why in SGD the true gradient of the loss function can be approximated by the gradient of one single example Question How to prove an estimator is unbiased Prove 0 Suppose x p 1 We sample x1 x2 xn from the population to get an estimator T x1 x2 xn 2 When T x1 x2 xn xi i 0 n we have E xi E x x1 3 Therefore we have n E T 1X E xi n i 1 n 1X E x n i 1 By far we prove that the estimator we generate from a dataset is unbiased because E T E x 0 c SGD has high variance Question How to reduce variance To reduce variance we can sample data randomly from dataset Mini Batch SGD multiple samples SGD Algorithm 1 Mini batch SGD Dn x1 xn randomly shuffle data for j in range epochs do P k 1 j g w k1 i kj Ow li w wj 1 wj j g w end for Question Why do we need to randomly shuffle data Answer It avoids correlation in sample order For instance real data are not i i d independent and identically distributed Correlated data are usually store together 15 Backpropagation and Multi layer Perceptron 3 Bias of Mini Batch Suppose T x1 x2 xn 1 n Pn i 1 xi we have Bias T E T E x n 1X E xi E x n i 1 n 1X E xi E x n i 1 n 1X xi E x n i 1 0 What is the tradeoff when we use SGD and Mini batch SGD instead of gradient descent To compare with gradient descent SGD and mini batch SGD we have Gradient Descent To compute gradient for each iteration the computational cost is N while is the cost of computing gradient for a single sample Pro Gradient descent also known as steepest descent generally converges faster than SGD and mini batch SGD You can think about GD can always find the shortest path to the global optimal if it doesn t get stuck at the local optimal Con It requires higher memory to store gradients Gradient descent has higher computational cost The computation cost of GD is expensive when the dataset is huge SGD The computational cost of SGD is Pro SGD is less likely to get stuck at flat regions because it has certain amount of noise Murphy 2012 to help it get rid of the local optimum Con SGD has slower convergence rate in general Mini batch SGD The computational cost of mini batch SGD is k k is the size of samples Pro Mini batch SGD is less likely to get stuck at flat regions 4 15 Backpropagation and Multi layer Perceptron Subway How to estimate the final parameter 1 Pick the last value wt which may not exactly be the global optimal P 2 T1 Tt 1 wt Average the entire parameter sequence If T is large enough the first few steps will end up convergence quickly P 3 Choose a local average e g 1s st T s wt How to set the step size for SGD A proper step size should have the properties also known as the Robbin Monro conditions Murphy 2012 that X t t 1 X t2 t 1 Some choices for step size 1 t t t 1 t b m t e t If the step size is decreasing with an appropriate rate SGD mostly converges to a global optimum Otherwise it might converge to a local minimum Question Why the step size is decreasing Answer We need small adjustments near optimal point to avoid overshooting Backpropagation Chain rule f g w f g w w g w Some gradients x y 1 x xy y x max x y x Example 1 see Figure 1 f z x f x y z x y z f f z x y y z 1 x y 0 x y 15 Backpropagation and Multi layer Perceptron 5 Figure 1 function graph of f x y z Given x y z 2 5 4 in forward pass we have x y 3 x y z 12 In backward pass we can compute the gradients f x y f z f x f y z 4 x y 3 f x y 4 x y x f x y 4 x y y Example 2 see Figure 2 f w x 1 1 e w1 x1 w2 x2 w3 To compute w1 w2 w3 gradients we need a b b a a b 1 a z z 1 z z Example 3 see Figure 3 f w x max 0 w1 x1 w2 x2 w3 x 1 x 6 15 Backpropagation and Multi layer Perceptron Figure 2 function graph of f w x Figure 3 function graph of f w x Gradients we need max 0 z 1 z 0 z a b 1 a a b b a Given x1 x2 w1 w2 w3 1 2 5 3 4 in forward pass we have w1 x1 5 w2 x2 6 w1 x1 w2 x2 1 w1 x1 w2 x2 w3 3 max 0 w1 x1 w2 x2 w3 3 15 Backpropagation and Multi layer Perceptron 7 In backward pass f w1 x1 w2 x2 w3 f w1 x1 f w2 x2 f w3 f w1 f w2 1 w1 x1 w2 x2 w3 0 1 f w1 x1 w2 x2 w3 w1 x1 w2 x2 w3 w1 x1 f w1 x1 w2 x2 w3 w1 x1 w2 x2 w3 w2 x2 f w1 x1 w2 x2 w3 w1 x1 w2 x2 w3 w3 f w1 x1 w2 x2 w3 w1 x1 w1 x1 w2 x2 w3 w1 x1 w1 f w1 x1 w2 x2 w3 w2 x2 w1 x1 w2 x2 w3 w2 x2 w2 1 1 1 1 2 Multi layer perceptron Suppose …

View Full Document