ILLINOIS CS 446 - 102617.2 - D3426159

Home> Schools> University of Illinois - urbana> (CS) > CS 446> 102617.2

DOC PREVIEW

ILLINOIS CS 446 - 102617.2

School name University of Illinois - urbana

Course Cs 446- Machine Learning

Pages 7

This preview shows page 1-2 out of 7 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 7 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 7 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 7 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

CS446: Machine Learning, Fall 2017Lecture 17 : MLP-PartII Deep Feedforward Neural NetworkLecturer: Sanmi Koyejo Scribe: Shidi Zhao, Oct. 26th, 2017Recap• MLP• loss function• trainning tips1)Deep feedforward neural networkFigure 1: computation graph of simple deep feedforward neural networkWe can write out the items in each layer:z1,1= g(wT1,1x)zi,k= g(wTi,kzk−1)12 17 : MLP-PartII Deep Feedforward Ne ural Networkg(x) is some nonlinearity applied to the product. For example, ReLU, and sigmoid.zk=zk,1zk,2...zk,m=g←− wTk,1−→←− . −→←− . −→←− . −→←− wTk,m−→..zk−1..zkmeans the arbitrary layer that we are looking at.2)Loss function• BinaryThe output layer:zl= σ(wTlzl−1)here we do not write bias to make it simpler.The loss function: L(yi, f(xi)) = yilog(f(xi)) + (1 − yi)log(1 − f(xi))= log(1 + e−y(wTlzl−1))This loss function is also called log loss or binary cross entropy.• Alternative binaryThe output layer(linear function):zl= f(x) = wTzl−1+ bl−1The loss function(hinge loss):L(yi, f(xi) = max(0, 1 − yf(xi))The first two examples are used to solve Binary classification problems.”Binary” methoduses log loss with sigmoid function, and ”alternative binary” method uses hinge lossdirectly to linear activation.17 : MLP-PartII Deep Feedforward Ne ural Network 3• Multiclass classificationHere assume that we are trying to predict k labels, so y={1....k}The output layer(map the vector to ”probabilities”):zl,j=e+wTjzl−1Pki=1e+wTizl−1≈ P (yi= j), whichXzl,k= 1The loss function:L(yi, f(xi)) =kXj=1yijlog(f(xi)j)Here this loss function is also called discrete cross entropy.It checks that for each possible configurations, whether yiis in that class or not, andscore this matching bylog(f(xi)j). Thus (f(xi)j) should be highest corresponding totrue y, and smaller for other cases. And yiis called ”one-hot coding”.• Alternative classificationThis is similar to alternative binary case, here we just have more labels)The output layer(linear function):zl= f(x) = wTzl−1+ bl−1The loss function(hinge loss):L(yi, f(xi) = max(0, 1 − yf(xi))• Regression(y ∈ Rk)The output layer:f(xi) −→ linearThe loss function:L(yi, f(xi)) = kyi− f(xi)k22• Multilabel(y∈ h0, 1ik)The output layer:4 17 : MLP-PartII Deep Feedforward Ne ural Networkf(x) = sigmoidThe loss function: L(y, f(xi) =Xk"binary cross entropy∀k#Here we treat each case as a binary classification,and sum all different labels.Optimization• Stochastic gradient descent(SGD)• mini batch gradient descentAdvantage:1) reduce the variance compared to standard SGD.2) faster than batch gradient descent.• Adam(extension of SGD, improve convergence)• RMSprop(extension of SGD, improve convergence)• Function graphFigure 2: function graph17 : MLP-PartII Deep Feedforward Ne ural Network 5Figure 3: local function graph for region (a)Figure 4: local function graph for region (b)Regularization1) L2regularization (gaussian prior):f(xi) represents the whole neural network:Σni=1L(yi, f(xi)) + Σlλlkwlk222) L1regularization (lasso):3) L∞maxnorm :either:max(wl) ≤ clor:Loss function + Σlλlkwlk∞6 17 : MLP-PartII Deep Feedforward Ne ural Network4)Dropout:”Dropout” means randomly drop some edges w, by possibility p.Thus, we take a layer zm,l, Um,l∼ bernoulli(p),then we do:zm,l◦ Um,lHere:Using this format will not change the backward propagation of errors. You nee to apply itboth for forward and backward steps.Advantage:• it is more ”robust” to noise.• it reduces overfitting by increasing bias.Problem:When we use full network model and ignore randomly dropout.Then the expected value willbe:E[z] = pz + (1 − p)0 = pzThus expected value will be scaled by factor p.Invert dropout:To solve this problem,we scale the Um,lfirstly, this is called invert dropout:zm,l◦ [Um,lp] −→ E[z] = zPractical tips:1) Initialize w as 0.01·N(0, 1)Question: why initialize 0 is bad idea?17 : MLP-PartII Deep Feedforward Ne ural Network 7•cause the nodes at same layer have the same value, it takes a while to separate themlater.• choice of g(x): could not escape from 0(e.g.ReLU)2) Standardize your input:x ∼ mean = 0x ∼ var = 1, per dimension3) Standardize layers(Batch normalization) 4) Don’t train neural network from scratch,instead,take the network that is already trained, chop off last layer, take it as aφ(x) feature vector,and train the final layer.Bibiography:Machine Learning: A Probabilistic Perspective, Murphy, K. P., (2012),The MIT

View Full Document


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2 out of 7 pages.

ILLINOIS CS 446 - 102617.2

Sign up for free to view:

Please select your school