ILLINOIS CS 446 - 110217.1 - D3426162

Home> Schools> University of Illinois - urbana> (CS) > CS 446> 110217.1

DOC PREVIEW

ILLINOIS CS 446 - 110217.1

School name University of Illinois - urbana

Course Cs 446- Machine Learning

Pages 7

This preview shows page 1-2 out of 7 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 7 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 7 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 7 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

CS446: Machine Learning, Fall 2017Lecture 20 : CNN, RNN and LSTMLecturer: Sanmi Koyejo Scribe: Wei Qian, Nov. 02th, 2017Announcement• Exam #2 is next Thursday (11/09/17)– Short review/recap next Tuesday (11/07/17)– Not explicitly cumulative and will include lectures from 10/3/17 to 10/31/17• Course final project is released and will be due on 12/19/17Agenda for Today• Recap• CNN Continue– Pooling Layer– Output Layer– Demo• Introduction to Recurrent Neural Network (RNN)Agenda for Next Tuesday• RNN Continue with LSTM• Start Unsupservised Learning• Short Exam #2 Review12 20 : CNN, RNN and LSTMCNN Operation and Output TensorIf we do convolution on aW1× H1× D1input tensor usingKnumber ofF × F × D1filterswith P padding and S stride:Figure 1: Convolution Operationthe resulting output tensor will have shape W2× H2× D2where:W2=W1− F + 2PS+ 1H2=H1− F + 2PS+ 1D2= KModel Hyper-parameterAs we can see here,K,F,P,Sand sometimes the filter depthDare all hyper-parameter, so• for small network, we do Hyper-Parameter Search using– Cross Validation– Bayesian Optimization• for large network that can take days or weeks to train, we just– start with others’ published parameters– tweak based on our own task20 : CNN, RNN and LSTM 3CNN Pooling and Output LayerRecall in an CNN, we also have pooling and output layers besides convolution layers describedabove:Figure 2: Convolution Neural Network ArchitecturePooling LayerThe main idea of pooling layer is to capture location and scaling invariance of the inputdata.• Max Pooling: max of values in the block• Average Pooling: average of values in the blockFigure 3: Example of Max PoolingSimilar to convolution, we can control the following parameters for the pooling layer including:• Pooling function: max, average etc• Filter size F4 20 : CNN, RNN and LSTM• Stride S (which is usually the same as filter size F )• Padding PIf we consider the 1D pattern matching example from last lecture, one prediction function isto take the max of the final output vectorZ, and that would give us 1 regardless where that1/pattern shows up, helping us capture the location invariance of the recognition task.Output LayerIn output layer, we want to turn our tensor into a prediction representation, and one suchlayer is fully connected layer:Figure 4: Fully Connected Output Layerwhich is similar to the perceptron introduced in previous lectures, and we can have multiplehidden/fully connected layer before the final output.Practical Tips for CNN (in Computer Vision)•To traIn networks that are invariance to rotation, we can do data augmentation wherewe produce new sample by rotating the original data. A similar tricks can be done tohandle other kinds of invariance.•For size invariance, we can do random rescaling on the original training data withpatches.20 : CNN, RNN and LSTM 5Demo• ConvNetJS• Effective RNN (for latter section)• Char-RNN Code (for latter section)Recurrent Neural Network (RNN)Example of a character level sequence prediction model running RNN:Figure 5: Character Level Prediction RNNFor the input layer here, we used a one-hot representation for our alphabet (since its sizeare relatively small), however for word level prediction, we might want to use symbol/wordembedding (i.e. word vector) representation instead of one-hot vector.We can formally define such network as:y(t)= g(V(t)Tz(t)+ a(t))z(t)= g(U(t)Tx(t)+ W(t)Tz(t−1)+ b(t))6 20 : CNN, RNN and LSTMwheregis the activation function,V(t),U(t)andW(t)are weight matrixes whilea(t)andb(t)are the bias terms.To ensure such model can apply to sequence of various length, we impose a weight sharingrule across time t such that:V(t)= V ∀tU(t)= U ∀tW(t)= W ∀ta(t)= a ∀tb(t)= b ∀tComputational GraphRNN has the following computational graph if we ignore the bias terms for simplification:Figure 6: RNN Cell Computation20 : CNN, RNN and LSTM 7BackpropagationIf we look at Figure 5, and look at the loss for outputy(3), we have the following gradientwith respect to W :∂cost(y(3))∂W=∂cost(y(3))∂y(3)·∂y(3)∂z(3)·∂z(3)∂g(3)·∂g(3)∂a(3)·∂a(3)∂Wwhere we have:a(3)= UTx(t)+ WTz(t−1)+ bAs we can see in a(3), both W and z(t−1)are related to W . Therefore, we can show that∂l(T )∂W∝ |W |T∂g(t)∂a(t)TSince T can be very large, we can see that as T get very large:• if |W | < 1, we have∂l(T )∂W→ 0, which is called vanish gradient.• if |W | > 1, we have∂l(T )∂W→ ∞, which is called exploding

View Full Document


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2 out of 7 pages.

ILLINOIS CS 446 - 110217.1

Sign up for free to view:

Please select your school