DOC PREVIEW
ILLINOIS CS 446 - 110217.1

This preview shows page 1-2 out of 7 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 7 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 7 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 7 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

CS446: Machine Learning, Fall 2017Lecture 20 : CNN, RNN and LSTMLecturer: Sanmi Koyejo Scribe: Wei Qian, Nov. 02th, 2017Announcement• Exam #2 is next Thursday (11/09/17)– Short review/recap next Tuesday (11/07/17)– Not explicitly cumulative and will include lectures from 10/3/17 to 10/31/17• Course final project is released and will be due on 12/19/17Agenda for Today• Recap• CNN Continue– Pooling Layer– Output Layer– Demo• Introduction to Recurrent Neural Network (RNN)Agenda for Next Tuesday• RNN Continue with LSTM• Start Unsupservised Learning• Short Exam #2 Review12 20 : CNN, RNN and LSTMCNN Operation and Output TensorIf we do convolution on aW1× H1× D1input tensor usingKnumber ofF × F × D1filterswith P padding and S stride:Figure 1: Convolution Operationthe resulting output tensor will have shape W2× H2× D2where:W2=W1− F + 2PS+ 1H2=H1− F + 2PS+ 1D2= KModel Hyper-parameterAs we can see here,K,F,P,Sand sometimes the filter depthDare all hyper-parameter, so• for small network, we do Hyper-Parameter Search using– Cross Validation– Bayesian Optimization• for large network that can take days or weeks to train, we just– start with others’ published parameters– tweak based on our own task20 : CNN, RNN and LSTM 3CNN Pooling and Output LayerRecall in an CNN, we also have pooling and output layers besides convolution layers describedabove:Figure 2: Convolution Neural Network ArchitecturePooling LayerThe main idea of pooling layer is to capture location and scaling invariance of the inputdata.• Max Pooling: max of values in the block• Average Pooling: average of values in the blockFigure 3: Example of Max PoolingSimilar to convolution, we can control the following parameters for the pooling layer including:• Pooling function: max, average etc• Filter size F4 20 : CNN, RNN and LSTM• Stride S (which is usually the same as filter size F )• Padding PIf we consider the 1D pattern matching example from last lecture, one prediction function isto take the max of the final output vectorZ, and that would give us 1 regardless where that1/pattern shows up, helping us capture the location invariance of the recognition task.Output LayerIn output layer, we want to turn our tensor into a prediction representation, and one suchlayer is fully connected layer:Figure 4: Fully Connected Output Layerwhich is similar to the perceptron introduced in previous lectures, and we can have multiplehidden/fully connected layer before the final output.Practical Tips for CNN (in Computer Vision)•To traIn networks that are invariance to rotation, we can do data augmentation wherewe produce new sample by rotating the original data. A similar tricks can be done tohandle other kinds of invariance.•For size invariance, we can do random rescaling on the original training data withpatches.20 : CNN, RNN and LSTM 5Demo• ConvNetJS• Effective RNN (for latter section)• Char-RNN Code (for latter section)Recurrent Neural Network (RNN)Example of a character level sequence prediction model running RNN:Figure 5: Character Level Prediction RNNFor the input layer here, we used a one-hot representation for our alphabet (since its sizeare relatively small), however for word level prediction, we might want to use symbol/wordembedding (i.e. word vector) representation instead of one-hot vector.We can formally define such network as:y(t)= g(V(t)Tz(t)+ a(t))z(t)= g(U(t)Tx(t)+ W(t)Tz(t−1)+ b(t))6 20 : CNN, RNN and LSTMwheregis the activation function,V(t),U(t)andW(t)are weight matrixes whilea(t)andb(t)are the bias terms.To ensure such model can apply to sequence of various length, we impose a weight sharingrule across time t such that:V(t)= V ∀tU(t)= U ∀tW(t)= W ∀ta(t)= a ∀tb(t)= b ∀tComputational GraphRNN has the following computational graph if we ignore the bias terms for simplification:Figure 6: RNN Cell Computation20 : CNN, RNN and LSTM 7BackpropagationIf we look at Figure 5, and look at the loss for outputy(3), we have the following gradientwith respect to W :∂cost(y(3))∂W=∂cost(y(3))∂y(3)·∂y(3)∂z(3)·∂z(3)∂g(3)·∂g(3)∂a(3)·∂a(3)∂Wwhere we have:a(3)= UTx(t)+ WTz(t−1)+ bAs we can see in a(3), both W and z(t−1)are related to W . Therefore, we can show that∂l(T )∂W∝ |W |T∂g(t)∂a(t)TSince T can be very large, we can see that as T get very large:• if |W | < 1, we have∂l(T )∂W→ 0, which is called vanish gradient.• if |W | > 1, we have∂l(T )∂W→ ∞, which is called exploding


View Full Document

ILLINOIS CS 446 - 110217.1

Download 110217.1
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view 110217.1 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view 110217.1 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?