# ILLINOIS CS 446 - 102417.1 (5 pages)

Previewing pages*1, 2*of 5 page document

**View the full content.**## 102417.1

Previewing pages
*1, 2*
of
actual document.

**View the full content.**View Full Document

## 102417.1

0 0 46 views

- Pages:
- 5
- School:
- University of Illinois - urbana
- Course:
- Cs 446 - Machine Learning

**Unformatted text preview:**

CS446 Machine Learning Fall 2017 Lecture 17 Neural Networks with Non Linear Activations Lecturer Sanmi Koyejo Scribe Justin Szaday szaday2 Oct 24th 2017 Announcements Please complete the midterm survey if you have not already The details of the project will be out by the end of the week half a class will be dedicated to discussing it next week The project will need a lot of compute hours and to help Microsoft has donated some to the class If you would like to learn more about their platform Azure they will be hosting a tutorial on it Wednesday November 1st There will be a Piazza post with details soon The Basic Perceptron Figure 1 A perceptron in graph form with linear activations The goal of the perceptron is to estimate a function f Rd 1 1 that makes good predictions given a dataset Dn xi yi ni 1 where xi Rd and yi 1 1 To do this we P aim to minimize the loss Dn n1 ni 1 yi f xi where for the perceptron in particular the function f is a simple linear function in the form of f x wT x b and the loss is given by yi f xi max 0 yi f xi This loss function linearly penalizes mistakes and has zero loss otherwise Figure 1 demonstrates a graphical representation of the perceptron and in the next section we will extend this concept to more complex functions 1 2 17 Neural Networks with Non Linear Activations The Multi layer Perceptron Figure 2 A multi layer perceptron in graph form with linear activations The idea of the multi layer perceptron stems from repeating the basic perceptron more than once To do this we repeat the basic perceptron multiple times forming a layer then feed the results into another layer then feed those results into another layer and so on and so forth Figure 2 shows a simple two layer example of this The first layer is formed of k functions that are equivalent to the standard perceptron each with their own weights and biases The k 1 vector they form which is called z is then fed into the output layer which forms the h 1 output vector which gets fed into the loss function during training This can be thought of as a 2 layer Neural Network with one hidden layer and linear activations which we will explain the meaning of in a later section Hidden layers refer to the layers leading up to the output layer the results of which form the vector z given by WT1 1 b1 1 WT b 1 2 1 2 z xi W1 x i b1 WT1 k b1 k 1 Likewise we can express f as f xi W2 z b2 W2 W1 xi b1 b2 2 This shows that the multi layer perceptron is equivalent to a simple linear model like linear regression As such we would like to expand it to include non linearity in the output so we can fit more complex functions 17 Neural Networks with Non Linear Activations 3 Neural Networks with Non Linear Activations Figure 3 A two layer neural network with activations So far we have discussed models in this course that fall into two categories linear models and linear models with non linear feature transformations equivalent to kernels Neural networks with non linear activations represent another category universal approximators This means that with enough complexity such models are able to perfectly fit any training set even those that are complex and non linear with no training error So what are activations Figure 3 shows the basic idea we add an activation function between the first and second layer In the case of the multi layer perceptron this activation function was implicit linear and given by x x however we are not limited to such simple choices for To add non linearity to our model some common choices for activation functions are the sigmoid hyperbolic tangent written as tanh and Rectify Linear Unit aka ReLu given by ReLu x max 0 x functions This simple addition to the model is extremely powerful and is what enables it to be a universal approximator For example it can be shown that for a two layer neural network with sigmoid activations there exists W1 W2 such that f g for any function g Rd R and a fixed where f represents the neural network While this might be a bit hard to understand it basically says that a non linear neural network can be fit to any function with arbitrary precision and therefore with 0 it is a universal approximator However being a universal approximator is not particularly impressive in and of itself since other such approximators exist For example a sum of RBF kernels can be a universal approximator since it can effectively point wise model a function The drawback to such a model is that its size is usually huge since it depends on every single data point of the training data This is where neural networks particularly deep neural networks those with many hidden layers shine A recent theorem showed that for certain functions the size of a shallow neural network with few hidden layers is exponentially sized whereas equivalent deep networks are small This means that comparatively deep neural networks are effective models 4 17 Neural Networks with Non Linear Activations Optimizing Neural Networks When it comes to optimizing neural networks we can think of them as a variant of feature selection This makes optimizing them similar to other models like boosting decision trees and bagging however in the case of bagging we simply average the weights instead of optimizing them We can consider z as a function of xi where z once again represents the output of the last hidden layer that effectively selects and weighs the features From this perspective the optimization problem becomes min W b Wk n X yi WT z xi b 3 i 1 Where W and b are the weights and biases of the last layer respectively Finding the Weights To solve the optimization problem in equation 3 we almost always use a variant of Stochastic Gradient Descent SGD Of the variants of SGD we usually use Back Propagation since it is particularly useful when computing the gradients of complex nested functions This is partly because back propagation can avoid the unnecessary re computations that stem from many of the nodes in neural networks feeding through the same nodes This exposes a trade off between speed and memory since the implementation can favor speed by avoiding re computations but may have to store a lot of data to do so Other variants of back propagation like RMS prop AdaGrad and Adam expose other ways to improve performance by adding approximations of higher order information like using historical gradients which improves convergence speed For this class however we will be focusing on the standard form of back propagation an example of

View Full Document