DOC PREVIEW
ILLINOIS CS 446 - 102617.1

This preview shows page 1-2-3 out of 8 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

CS446: Machine Learning, Fall 2017Lecture 17: MLP IILecturer: Sanmi Koyejo Scribe: Karthik Bala, Oct. 26th, 2017In this lecture, we finished discussing multilayer perceptrons, determining appropriate lossfunctions for different types of classification, and discussing dropout.Recap - Multilayer PerceptronsMultilayer perceptrons (also known as ”deep feed forward neural networks” can be drawn inthe following way.Each node in the hidden layer represents the nonlinear computationzi,k=g(wTi,kzk−1),whereg(x) is a nonlinear function. We can write the same computation for each layer interms of a matrix computation. Let m be the number of nodes.12 17: MLP IIThen layer k of our feedforward NN can be written as:zk=zk,1zk,2...zk,m= g wTk,1wTk,2...wTk,mzk−1,1zk−1,2...zk−1,m!Examples of nonlinear functions g are the Relu function and the sigmoid function.Output Layers, Loss FunctionsConsider binary classification:y={−1,1}. We can writezlin terms of all the nodes thatenter itzl= σ(wTlzl−1+ bias) = f(x)Given a choice of sigmoid for g, what is a reasonable loss function?l(yi, f(xi)) = yilog(f (xi)) + (1 − yi)log(1 − f (xi))= log(1 + e−y(wTlzl))The function above is referred to as the ”log-loss”, ”binary cross entropy”, or ”Bernoulli loglikelihood”. Now consider the case in which zlis linear.zl= f(x) = wTzl−1+ bl−1An alternative loss function for binary classification becomes the hinge loss:l(yi, f(xi)) = max(0, 1 − yif(xi))Multiclass ClassificationMulticlass classification is defined byy=h1...ki. Here we introduce the ”softmax” function:zl,k=e−wTjzl−1Pki=1e−wTizl−1= P rob(yi= j), for some j ∈ 1...k17: MLP II 3ThusXkzl,k= 1This is essentially a generalization of the sigmoid function for when k > 2. Note thatwl∈ Rk×(size of zl−1)as we need a weight vector for each class.Then the loss function isl(yi, f(xi)) =kXj=1yjlog(f (xj))which is often called the discrete cross entropy. Note that hereyis stored in a one hotencoding, so that f(xj) is highest for the highest yj.As an alternative, iff(x) =zl(a linear activation), the loss function is multiclass hinge loss.RegressionRegression is defined by y ∈ Rk. Recall that here f(x) is linear. Then the loss function isl(yk, f(xi)) = ky − f (x)k22and when y ∈ R this is simply(yi− f(xi))2Multilabel ClassificationMultilabel classification is defined byy={1...k}. The last layer (output) is a sigmoidfunction, and the loss function l(y, f ) is given byl(y, f) =Xk”binary cross entropy”.OptimizationSome optimization techniques we’ve discussed in class are stochastic gradient descent andminibatch SGD, which has the additional benefit of reducing variance compared to stochasticgradient descent. Some additional optimization techniques include Adam and RMSProp,both of which are extensions of SGD. Sanmi mentioned that both of these methods leveragethe history of the gradient and require additional memory.4 17: MLP IIFunction GraphConsider the function graphConsider a nodea, which sums the results of all the activation functions from the lastlayer, multiplies them by the weight vector, and then passes its result to the next activationfunction. The computation done by nodeais written in the center. The inputs and outputsfor the forward pass are drawn in blue, and the inputs and outputs for the backward passare drawn in red.17: MLP II 5Now consider a nodeb, which represents the computation of an activation function (theoutput of node a), and passes its output to all the nodes in the next layer.6 17: MLP IIRegularizationWe discuss a few methods for regularization.1.ConsiderL2regularization, or ”Gaussian Prior” regularization. Again we have a lossfunction (summed over each layer)nXi=1l(yi, f(xi)) +Xlλlkwlk22The second term regularizes the weights (as in logistic or linear regression) and can beadded as another node in backpropagation.2. L1regularization was also mentioned, but not discussed.3.The max normL∞normis defined askwk∞=maxd(wd). It can act as either aconstraint or a regularizer, that is, eithermax(wL) ≤ cLor given some loss function lXlλlkwLk∞17: MLP II 7DropoutHadamard ProductThe Hadamard Product (entrywise product) of two matricesAandBcan be written asA ◦ B.DropoutThe dropout technique is to randomly drop each edge with probabilityp, and can be writtenaszm,l◦ um,l, whereum,lcan be modeled as a Bernoulli distribution with probabilityp, in which eachedge is independently and identically dropped.Conceptually, this is similar to random forests, with the difference that random forestsrandomize choices over data (and sometimes features). Both techniques randomize thenetwork structure. Dropout is easily combined with back propagation. Additionally, it ismore ”robust” to noise and reduces overfitting. However, it also increases the bias of thenetwork. In practice, we use the full network (ignoring random dropout) for prediction.However, in training, the expected value of a node z isE[z] = pz + (1 − p)0 = pzwhich creates a problem that each node is scaled down byp. To resolve this, we must undothe effect of p on our final prediction:E[Z]p= zA more efficient way of achieving the same result is called inverted dropout.Inverted DropoutHere, we scale each node down bypbeforehand, so that each node’s expected value remainsthe same. That is, we takezm,l◦um,lpand thusE[z] = zfor each node.8 17: MLP IIPractical Tips1.Scale the weights down by initializing them to 0.01∗ N(0,1). Initializing the weightsto zero is a bad idea, as nodes at the same layer have the same values and thus wouldincrease the time taken for back propagation. Furthermore, initializing the bias andweights to 0 and choosing Relu for the activation function and could result in anoutput of zero. This is bad.2.It is often useful to standardize the input, that is, each dimension ofxshould have 0mean and 1 variance per dimension.3. Standardize the layers. This is called batch normalization.4.Instead of training the network from scratch, use an already trained network, chop offthe last layer, and train the final layer. Conceptually, this can be thought of as usingthe pre-trained network as a feature


View Full Document

ILLINOIS CS 446 - 102617.1

Download 102617.1
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view 102617.1 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view 102617.1 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?