# ILLINOIS CS 446 - 110217.1 (7 pages)

Previewing pages*1, 2*of 7 page document

**View the full content.**## 110217.1

Previewing pages
*1, 2*
of
actual document.

**View the full content.**View Full Document

## 110217.1

0 0 52 views

- Pages:
- 7
- School:
- University of Illinois - urbana
- Course:
- Cs 446 - Machine Learning

**Unformatted text preview:**

CS446 Machine Learning Fall 2017 Lecture 20 CNN RNN and LSTM Lecturer Sanmi Koyejo Scribe Wei Qian Nov 02th 2017 Announcement Exam 2 is next Thursday 11 09 17 Short review recap next Tuesday 11 07 17 Not explicitly cumulative and will include lectures from 10 3 17 to 10 31 17 Course final project is released and will be due on 12 19 17 Agenda for Today Recap CNN Continue Pooling Layer Output Layer Demo Introduction to Recurrent Neural Network RNN Agenda for Next Tuesday RNN Continue with LSTM Start Unsupservised Learning Short Exam 2 Review 1 2 20 CNN RNN and LSTM CNN Operation and Output Tensor If we do convolution on a W1 H1 D1 input tensor using K number of F F D1 filters with P padding and S stride Figure 1 Convolution Operation the resulting output tensor will have shape W2 H2 D2 where W1 F 2P 1 W2 S H1 F 2P H2 1 S D2 K Model Hyper parameter As we can see here K F P S and sometimes the filter depth D are all hyper parameter so for small network we do Hyper Parameter Search using Cross Validation Bayesian Optimization for large network that can take days or weeks to train we just start with others published parameters tweak based on our own task 20 CNN RNN and LSTM 3 CNN Pooling and Output Layer Recall in an CNN we also have pooling and output layers besides convolution layers described above Figure 2 Convolution Neural Network Architecture Pooling Layer The main idea of pooling layer is to capture location and scaling invariance of the input data Max Pooling max of values in the block Average Pooling average of values in the block Figure 3 Example of Max Pooling Similar to convolution we can control the following parameters for the pooling layer including Pooling function max average etc Filter size F 4 20 CNN RNN and LSTM Stride S which is usually the same as filter size F Padding P If we consider the 1D pattern matching example from last lecture one prediction function is to take the max of the final output vector Z and that would give us 1 regardless where that 1 pattern shows up helping us capture the location invariance of the recognition task Output Layer In output layer we want to turn our tensor into a prediction representation and one such layer is fully connected layer Figure 4 Fully Connected Output Layer which is similar to the perceptron introduced in previous lectures and we can have multiple hidden fully connected layer before the final output Practical Tips for CNN in Computer Vision To traIn networks that are invariance to rotation we can do data augmentation where we produce new sample by rotating the original data A similar tricks can be done to handle other kinds of invariance For size invariance we can do random rescaling on the original training data with patches 20 CNN RNN and LSTM 5 Demo ConvNetJS Effective RNN for latter section Char RNN Code for latter section Recurrent Neural Network RNN Example of a character level sequence prediction model running RNN Figure 5 Character Level Prediction RNN For the input layer here we used a one hot representation for our alphabet since its size are relatively small however for word level prediction we might want to use symbol word embedding i e word vector representation instead of one hot vector We can formally define such network as y t g V t T z t a t z t g U t T x t W t T z t 1 b t 6 20 CNN RNN and LSTM where g is the activation function V t U t and W t are weight matrixes while a t and b t are the bias terms To ensure such model can apply to sequence of various length we impose a weight sharing rule across time t such that V t V t U t U t W t W a t a t b t b t t Computational Graph RNN has the following computational graph if we ignore the bias terms for simplification Figure 6 RNN Cell Computation 20 CNN RNN and LSTM 7 Backpropagation If we look at Figure 5 and look at the loss for output y 3 we have the following gradient with respect to W cost y 3 cost y 3 y 3 z 3 g 3 a 3 3 3 3 W W y 3 z g a where we have a 3 U T x t W T z t 1 b As we can see in a 3 both W and z t 1 are related to W Therefore we can show that l T g t W T W a t T Since T can be very large we can see that as T get very large if W 1 we have l T W 0 which is called vanish gradient if W 1 we have l T W which is called exploding gradient

View Full Document