DOC PREVIEW
ILLINOIS CS 446 - 102617.2

This preview shows page 1-2 out of 7 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 7 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 7 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 7 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

CS446: Machine Learning, Fall 2017Lecture 17 : MLP-PartII Deep Feedforward Neural NetworkLecturer: Sanmi Koyejo Scribe: Shidi Zhao, Oct. 26th, 2017Recap• MLP• loss function• trainning tips1)Deep feedforward neural networkFigure 1: computation graph of simple deep feedforward neural networkWe can write out the items in each layer:z1,1= g(wT1,1x)zi,k= g(wTi,kzk−1)12 17 : MLP-PartII Deep Feedforward Ne ural Networkg(x) is some nonlinearity applied to the product. For example, ReLU, and sigmoid.zk=zk,1zk,2...zk,m=g←− wTk,1−→←− . −→←− . −→←− . −→←− wTk,m−→..zk−1..zkmeans the arbitrary layer that we are looking at.2)Loss function• BinaryThe output layer:zl= σ(wTlzl−1)here we do not write bias to make it simpler.The loss function: L(yi, f(xi)) = yilog(f(xi)) + (1 − yi)log(1 − f(xi))= log(1 + e−y(wTlzl−1))This loss function is also called log loss or binary cross entropy.• Alternative binaryThe output layer(linear function):zl= f(x) = wTzl−1+ bl−1The loss function(hinge loss):L(yi, f(xi) = max(0, 1 − yf(xi))The first two examples are used to solve Binary classification problems.”Binary” methoduses log loss with sigmoid function, and ”alternative binary” method uses hinge lossdirectly to linear activation.17 : MLP-PartII Deep Feedforward Ne ural Network 3• Multiclass classificationHere assume that we are trying to predict k labels, so y={1....k}The output layer(map the vector to ”probabilities”):zl,j=e+wTjzl−1Pki=1e+wTizl−1≈ P (yi= j), whichXzl,k= 1The loss function:L(yi, f(xi)) =kXj=1yijlog(f(xi)j)Here this loss function is also called discrete cross entropy.It checks that for each possible configurations, whether yiis in that class or not, andscore this matching bylog(f(xi)j). Thus (f(xi)j) should be highest corresponding totrue y, and smaller for other cases. And yiis called ”one-hot coding”.• Alternative classificationThis is similar to alternative binary case, here we just have more labels)The output layer(linear function):zl= f(x) = wTzl−1+ bl−1The loss function(hinge loss):L(yi, f(xi) = max(0, 1 − yf(xi))• Regression(y ∈ Rk)The output layer:f(xi) −→ linearThe loss function:L(yi, f(xi)) = kyi− f(xi)k22• Multilabel(y∈ h0, 1ik)The output layer:4 17 : MLP-PartII Deep Feedforward Ne ural Networkf(x) = sigmoidThe loss function: L(y, f(xi) =Xk"binary cross entropy∀k#Here we treat each case as a binary classification,and sum all different labels.Optimization• Stochastic gradient descent(SGD)• mini batch gradient descentAdvantage:1) reduce the variance compared to standard SGD.2) faster than batch gradient descent.• Adam(extension of SGD, improve convergence)• RMSprop(extension of SGD, improve convergence)• Function graphFigure 2: function graph17 : MLP-PartII Deep Feedforward Ne ural Network 5Figure 3: local function graph for region (a)Figure 4: local function graph for region (b)Regularization1) L2regularization (gaussian prior):f(xi) represents the whole neural network:Σni=1L(yi, f(xi)) + Σlλlkwlk222) L1regularization (lasso):3) L∞maxnorm :either:max(wl) ≤ clor:Loss function + Σlλlkwlk∞6 17 : MLP-PartII Deep Feedforward Ne ural Network4)Dropout:”Dropout” means randomly drop some edges w, by possibility p.Thus, we take a layer zm,l, Um,l∼ bernoulli(p),then we do:zm,l◦ Um,lHere:Using this format will not change the backward propagation of errors. You nee to apply itboth for forward and backward steps.Advantage:• it is more ”robust” to noise.• it reduces overfitting by increasing bias.Problem:When we use full network model and ignore randomly dropout.Then the expected value willbe:E[z] = pz + (1 − p)0 = pzThus expected value will be scaled by factor p.Invert dropout:To solve this problem,we scale the Um,lfirstly, this is called invert dropout:zm,l◦ [Um,lp] −→ E[z] = zPractical tips:1) Initialize w as 0.01·N(0, 1)Question: why initialize 0 is bad idea?17 : MLP-PartII Deep Feedforward Ne ural Network 7•cause the nodes at same layer have the same value, it takes a while to separate themlater.• choice of g(x): could not escape from 0(e.g.ReLU)2) Standardize your input:x ∼ mean = 0x ∼ var = 1, per dimension3) Standardize layers(Batch normalization) 4) Don’t train neural network from scratch,instead,take the network that is already trained, chop off last layer, take it as aφ(x) feature vector,and train the final layer.Bibiography:Machine Learning: A Probabilistic Perspective, Murphy, K. P., (2012),The MIT


View Full Document

ILLINOIS CS 446 - 102617.2

Download 102617.2
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view 102617.2 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view 102617.2 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?