# ILLINOIS CS 446 - 102617.2 (7 pages)

Previewing pages*1, 2*of 7 page document

**View the full content.**## 102617.2

Previewing pages
*1, 2*
of
actual document.

**View the full content.**View Full Document

## 102617.2

0 0 47 views

- Pages:
- 7
- School:
- University of Illinois - urbana
- Course:
- Cs 446 - Machine Learning

**Unformatted text preview:**

CS446 Machine Learning Fall 2017 Lecture 17 MLP PartII Deep Feedforward Neural Network Lecturer Sanmi Koyejo Scribe Shidi Zhao Oct 26th 2017 Recap MLP loss function trainning tips 1 Deep feedforward neural network Figure 1 computation graph of simple deep feedforward neural network We can write out the items in each layer T z1 1 g w1 1 x T zi k g wi k zk 1 1 2 17 MLP PartII Deep Feedforward Neural Network g x is some nonlinearity applied to the product For example ReLU and sigmoid zk 1 T w k 1 zk 2 zk zk 1 g T wk m zk m zk means the arbitrary layer that we are looking at 2 Loss function Binary The output layer zl wlT zl 1 here we do not write bias to make it simpler The loss function L yi f xi yi log f xi 1 yi log 1 f xi Tz log 1 e y wl l 1 This loss function is also called log loss or binary cross entropy Alternative binary The output layer linear function zl f x wT zl 1 bl 1 The loss function hinge loss L yi f xi max 0 1 yf xi The first two examples are used to solve Binary classification problems Binary method uses log loss with sigmoid function and alternative binary method uses hinge loss directly to linear activation 17 MLP PartII Deep Feedforward Neural Network 3 Multiclass classification Here assume that we are trying to predict k labels so y 1 k The output layer map the vector to probabilities T e wj zl 1 zl j Pk wiT zl 1 i 1 e P yi j which X zl k 1 The loss function L yi f xi k X yij log f xi j j 1 Here this loss function is also called discrete cross entropy It checks that for each possible configurations whether yi is in that class or not and score this matching by log f xi j Thus f xi j should be highest corresponding to true y and smaller for other cases And yi is called one hot coding Alternative classification This is similar to alternative binary case here we just have more labels The output layer linear function zl f x wT zl 1 bl 1 The loss function hinge loss L yi f xi max 0 1 yf xi Regression y Rk The output layer f xi linear The loss function L yi f xi kyi f xi k22 Multilabel y h0 1ik The output layer 4 17 MLP PartII Deep Feedforward Neural Network f x sigmoid The loss function X binary cross entropy L y f xi k k Here we treat each case as a binary classification and sum all different labels Optimization Stochastic gradient descent SGD mini batch gradient descent Advantage 1 reduce the variance compared to standard SGD 2 faster than batch gradient descent Adam extension of SGD improve convergence RMSprop extension of SGD improve convergence Function graph Figure 2 function graph 17 MLP PartII Deep Feedforward Neural Network Figure 3 local function graph for region a Figure 4 local function graph for region b Regularization 1 L2 regularization gaussian prior f xi represents the whole neural network ni 1 L yi f xi l l kwl k22 2 L1 regularization lasso 3 L maxnorm either max wl cl or Loss function l l kwl k 5 6 17 MLP PartII Deep Feedforward Neural Network 4 Dropout Dropout means randomly drop some edges w by possibility p Thus we take a layer zm l Um l bernoulli p then we do zm l Um l Here Using this format will not change the backward propagation of errors You nee to apply it both for forward and backward steps Advantage it is more robust to noise it reduces overfitting by increasing bias Problem When we use full network model and ignore randomly dropout Then the expected value will be E z pz 1 p 0 pz Thus expected value will be scaled by factor p Invert dropout To solve this problem we scale the Um l firstly this is called invert dropout zm l Practical tips 1 Initialize w as 0 01 N 0 1 Question why initialize 0 is bad idea Um l E z z p 17 MLP PartII Deep Feedforward Neural Network 7 cause the nodes at same layer have the same value it takes a while to separate them later choice of g x could not escape from 0 e g ReLU 2 Standardize your input x mean 0 x var 1 per dimension 3 Standardize layers Batch normalization 4 Don t train neural network from scratch instead take the network that is already trained chop off last layer take it as a x feature vector and train the final layer Bibiography Machine Learning A Probabilistic Perspective Murphy K P 2012 The MIT Press

View Full Document