# ILLINOIS CS 446 - 102617.1 (8 pages)

Previewing pages*1, 2, 3*of 8 page document

**View the full content.**## 102617.1

Previewing pages
*1, 2, 3*
of
actual document.

**View the full content.**View Full Document

## 102617.1

0 0 53 views

- Pages:
- 8
- School:
- University of Illinois - urbana
- Course:
- Cs 446 - Machine Learning

**Unformatted text preview:**

CS446 Machine Learning Fall 2017 Lecture 17 MLP II Lecturer Sanmi Koyejo Scribe Karthik Bala Oct 26th 2017 In this lecture we finished discussing multilayer perceptrons determining appropriate loss functions for different types of classification and discussing dropout Recap Multilayer Perceptrons Multilayer perceptrons also known as deep feed forward neural networks can be drawn in the following way T z Each node in the hidden layer represents the nonlinear computation zi k g wi k k 1 where g x is a nonlinear function We can write the same computation for each layer in terms of a matrix computation Let m be the number of nodes 1 2 17 MLP II Then layer k of our feedforward NN can be zk 1 zk 2 zk g written as T wk 1 zk 1 1 T wk 2 zk 1 2 T wk m zk m zk 1 m Examples of nonlinear functions g are the Relu function and the sigmoid function Output Layers Loss Functions Consider binary classification y 1 1 We can write zl in terms of all the nodes that enter it zl wlT zl 1 bias f x Given a choice of sigmoid for g what is a reasonable loss function l yi f xi yi log f xi 1 yi log 1 f xi Tz log 1 e y wl l The function above is referred to as the log loss binary cross entropy or Bernoulli log likelihood Now consider the case in which zl is linear zl f x wT zl 1 bl 1 An alternative loss function for binary classification becomes the hinge loss l yi f xi max 0 1 yi f xi Multiclass Classification Multiclass classification is defined by y h1 ki Here we introduce the softmax function Tz e wj zl k Pk i 1 e l 1 wiT zl 1 P rob yi j for some j 1 k 17 MLP II 3 Thus X zl k 1 k This is essentially a generalization of the sigmoid function for when k 2 Note that wl Rk size of zl 1 as we need a weight vector for each class Then the loss function is l yi f xi k X yj log f xj j 1 which is often called the discrete cross entropy Note that here y is stored in a one hot encoding so that f xj is highest for the highest yj As an alternative if f x zl a linear activation the loss function is multiclass hinge loss Regression Regression is defined by y Rk Recall that here f x is linear Then the loss function is l yk f xi ky f x k22 and when y R this is simply yi f xi 2 Multilabel Classification Multilabel classification is defined by y 1 k The last layer output is a sigmoid function and the loss function l y f is given by X l y f binary cross entropy k Optimization Some optimization techniques we ve discussed in class are stochastic gradient descent and minibatch SGD which has the additional benefit of reducing variance compared to stochastic gradient descent Some additional optimization techniques include Adam and RMSProp both of which are extensions of SGD Sanmi mentioned that both of these methods leverage the history of the gradient and require additional memory 4 17 MLP II Function Graph Consider the function graph Consider a node a which sums the results of all the activation functions from the last layer multiplies them by the weight vector and then passes its result to the next activation function The computation done by node a is written in the center The inputs and outputs for the forward pass are drawn in blue and the inputs and outputs for the backward pass are drawn in red 17 MLP II 5 Now consider a node b which represents the computation of an activation function the output of node a and passes its output to all the nodes in the next layer 6 17 MLP II Regularization We discuss a few methods for regularization 1 Consider L2 regularization or Gaussian Prior regularization Again we have a loss function summed over each layer n X l yi f xi i 1 X l kwl k22 l The second term regularizes the weights as in logistic or linear regression and can be added as another node in backpropagation 2 L1 regularization was also mentioned but not discussed 3 The max norm L norm is defined as kwk maxd wd It can act as either a constraint or a regularizer that is either max wL cL or given some loss function l X l l kwL k 17 MLP II 7 Dropout Hadamard Product The Hadamard Product entrywise product of two matrices A and B can be written as A B Dropout The dropout technique is to randomly drop each edge with probability p and can be written as zm l um l where um l can be modeled as a Bernoulli distribution with probability p in which each edge is independently and identically dropped Conceptually this is similar to random forests with the difference that random forests randomize choices over data and sometimes features Both techniques randomize the network structure Dropout is easily combined with back propagation Additionally it is more robust to noise and reduces overfitting However it also increases the bias of the network In practice we use the full network ignoring random dropout for prediction However in training the expected value of a node z is E z pz 1 p 0 pz which creates a problem that each node is scaled down by p To resolve this we must undo the effect of p on our final prediction E Z z p A more efficient way of achieving the same result is called inverted dropout Inverted Dropout Here we scale each node down by p beforehand so that each node s expected value remains the same That is we take um l zm l p and thus E z z for each node 8 17 MLP II Practical Tips 1 Scale the weights down by initializing them to 0 01 N 0 1 Initializing the weights to zero is a bad idea as nodes at the same layer have the same values and thus would increase the time taken for back propagation Furthermore initializing the bias and weights to 0 and choosing Relu for the activation function and could result in an output of zero This is bad 2 It is often useful to standardize the input that is each dimension of x should have 0 mean and 1 variance per dimension 3 Standardize the layers This is called batch normalization 4 Instead of training the network from scratch use an already trained network chop off the last layer and train the final layer Conceptually this can be thought of as using the pre trained network as a feature extractor

View Full Document