# ILLINOIS CS 446 - 102617.2 (7 pages)

Previewing pages*1, 2*of 7 page document

**View the full content.**## 102617.2

Previewing pages
*1, 2*
of
actual document.

**View the full content.**View Full Document

## 102617.2

0 0 41 views

- Pages:
- 7
- School:
- University of Illinois - urbana
- Course:
- Cs 446 - Machine Learning

**Unformatted text preview: **

CS446 Machine Learning Fall 2017 Lecture 17 MLP PartII Deep Feedforward Neural Network Lecturer Sanmi Koyejo Scribe Shidi Zhao Oct 26th 2017 Recap MLP loss function trainning tips 1 Deep feedforward neural network Figure 1 computation graph of simple deep feedforward neural network We can write out the items in each layer T z1 1 g w1 1 x T zi k g wi k zk 1 1 2 17 MLP PartII Deep Feedforward Neural Network g x is some nonlinearity applied to the product For example ReLU and sigmoid zk 1 T w k 1 zk 2 zk zk 1 g T wk m zk m zk means the arbitrary layer that we are looking at 2 Loss function Binary The output layer zl wlT zl 1 here we do not write bias to make it simpler The loss function L yi f xi yi log f xi 1 yi log 1 f xi Tz log 1 e y wl l 1 This loss function is also called log loss or binary cross entropy Alternative binary The output layer linear function zl f x wT zl 1 bl 1 The loss function hinge loss L yi f xi max 0 1 yf xi The first two examples are used to solve Binary classification problems Binary method uses log loss with sigmoid function and alternative binary method uses hinge loss directly to linear activation 17 MLP PartII Deep Feedforward Neural Network 3 Multiclass classification Here assume that we are trying to predict k labels so y 1 k The output layer map the vector to probabilities T e wj zl 1 zl j Pk wiT zl 1 i 1 e P yi j which X zl k 1 The loss function L yi f xi k X yij log f xi j j 1 Here this loss function is also called discrete cross entropy It checks that for each possible configurations whether yi is in that class or not and score this matching by log f xi j Thus f xi j should be highest corresponding to true y and smaller for other cases And yi is called one hot coding Alternative classification This is similar to alternative binary case here we just have more labels The output layer linear function zl f x wT zl 1 bl 1 The loss function hinge loss L yi f xi max 0 1 yf xi Regression y Rk The output layer f xi linear The loss function L yi f xi kyi f xi k22 Multilabel y h0 1ik The output layer 4 17 MLP PartII Deep Feedforward Neural Network f x sigmoid The loss function X binary cross entropy L y f xi k k Here we treat each case as a binary classification and sum all different labels Optimization Stochastic gradient descent SGD mini batch gradient descent Advantage 1 reduce the variance compared to standard SGD 2 faster than batch gradient descent Adam extension of SGD improve convergence RMSprop extension

View Full Document