# ILLINOIS CS 446 - 101917.1 (9 pages)

Previewing pages*1, 2, 3*of 9 page document

**View the full content.**## 101917.1

Previewing pages
*1, 2, 3*
of
actual document.

**View the full content.**View Full Document

## 101917.1

0 0 89 views

- Pages:
- 9
- School:
- University of Illinois - urbana
- Course:
- Cs 446 - Machine Learning

**Unformatted text preview: **

CS446 Machine Learning Fall 2017 Lecture 15 Backpropagation and Multi layer Perceptron Lecturer Sanmi Koyejo Scribe Zhenfeng Chen Oct 19th 2017 Agenda Recap SGD Perceptron Backpropagation Multi layer Perceptron Recap Stochastic Gradient Descent SGD The risk function can usually be approximated by the empirical risk which equals to the average of the loss of samples n l w 1X li w n i 1 R f w p Ep l w where w is the model parameter f w is a classifier wt 1 wt Ow R p Therefore we can get a good estimator of Ow R p for gradient descent by computing the empirical average of the gradient Some properties a Under weak condition we have Ow Ep l w Ep Ow l w It means for most of the loss functions we noticed that the gradient of the expectation of loss functions equals to the expectation of the gradient of the loss function 1 2 15 Backpropagation and Multi layer Perceptron b SGD needs an unbiased estimator Ow li w The gradient of one sample is an unbiased estimator of the gradient so that to solve the problem in a we only need an unbiased estimator of gradient instead of the true gradient That is why in SGD the true gradient of the loss function can be approximated by the gradient of one single example Question How to prove an estimator is unbiased Prove 0 Suppose x p 1 We sample x1 x2 xn from the population to get an estimator T x1 x2 xn 2 When T x1 x2 xn xi i 0 n we have E xi E x x1 3 Therefore we have n E T 1X E xi n i 1 n 1X E x n i 1 By far we prove that the estimator we generate from a dataset is unbiased because E T E x 0 c SGD has high variance Question How to reduce variance To reduce variance we can sample data randomly from dataset Mini Batch SGD multiple samples SGD Algorithm 1 Mini batch SGD Dn x1 xn randomly shuffle data for j in range epochs do P k 1 j g w k1 i kj Ow li w wj 1 wj j g w end for Question Why do we need to randomly shuffle data Answer It avoids correlation in sample order For instance real data are not i i d independent and identically distributed Correlated data are usually store together 15 Backpropagation and Multi layer Perceptron 3 Bias of Mini Batch Suppose T x1 x2 xn 1 n Pn i 1 xi we have Bias T E T E x n 1X E xi E x n i 1 n 1X E xi E x n i 1 n 1X xi E x n i 1 0 What is the tradeoff when we use SGD and Mini batch SGD instead of gradient descent To compare with gradient descent SGD and mini batch SGD we have Gradient Descent

View Full Document