ILLINOIS CS 446 - 102417.1 - D3426156

Home> Schools> University of Illinois - urbana> (CS) > CS 446> 102417.1

DOC PREVIEW

ILLINOIS CS 446 - 102417.1

School name University of Illinois - urbana

Course Cs 446- Machine Learning

Pages 5

This preview shows page 1-2 out of 5 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 5 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 5 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 5 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

CS446: Machine Learning, Fall 2017Lecture 17 : Neural Networks with Non-Linear ActivationsLecturer: Sanmi Koyejo Scribe: Justin Szaday (szaday2), Oct. 24th, 2017Announcements• Please complete the midterm survey if you have not already.•The details of the project will be out by the end of the week, half a class will bededicated to discussing it next week.•The project will need a lot of compute hours and to help, Microsoft has donated someto the class. If you would like to learn more about their platform, Azure, they will behosting a tutorial on it Wednesday November 1st. There will be a Piazza post withdetails soon.The Basic PerceptronFigure 1: A perceptron in graph form, with linear activationsThe goal of the perceptron is to estimate a functionf:Rd→[−1,1] that makes goodpredictions given a datasetDn=[xi, yi]ni=1(wherexi∈ Rdandyi∈[−1,1]). To do this, weaim to minimize the loss`(Dn) =1nPni=1`(yi, f(xi)) where, for the perceptron in particular,the functionfis a simple linear function, in the form off(x) =wTx+b, and the loss isgiven by`(yi, f(xi)) =max(0, −yif(xi)). This loss function linearly penalizes mistakes andhas zero loss otherwise. Figure 1 demonstrates a graphical representation of the perceptronand, in the next section, we will extend this concept to more complex functions.12 17 : Neural Networks with Non-Linear ActivationsThe Multi-layer PerceptronFigure 2: A multi-layer perceptron in graph form, with linear activationsThe idea of the multi-layer perceptron stems from repeating the basic perceptron more thanonce. To do this, we repeat the basic perceptron multiple times, forming a layer, then feedthe results into another layer, then feed those results into another layer and so on and soforth. Figure 2 shows a simple, two-layer example of this. The first layer is formed ofkfunctions that are equivalent to the standard perceptron, each with their own weights andbiases. Thek ×1 vector they form, which is calledz, is then fed into the output layer, whichforms the h × 1 output vector (which gets fed into the loss function during training). Thiscan be thought of as a 2-layer Neural Network, with one hidden layer and linear activations(which we will explain the meaning of in a later section). Hidden layers refer to the layersleading up to the output layer, the results of which form the vector z, given by,z =−WT1,1−−WT1,2−. . .−WT1,k−xi+−b1,1−−b1,2−. . .−b1,k−= W1xi+ b1(1)Likewise, we can express f as,f(xi) = W2z + b2= W2(W1xi+ b1) + b2(2)This shows that the multi-layer perceptron is equivalent to a simple linear model, like linearregression. As such, we would like to expand it to include non-linearity in the output so wecan fit more complex functions.17 : Neural Networks with Non-Linear Activations 3Neural Networks with Non-Linear ActivationsFigure 3: A two-layer neural network with φ activationsSo far, we have discussed models in this course that fall into two categories: linear modelsand linear models with non-linear feature transformations (equivalent to kernels). Neuralnetworks with non-linear activations represent another category, universal approximators.This means that with enough complexity, such models are able to perfectly fit any trainingset (even those that are complex and non-linear) with no training error. So, what areactivations?Figure 3 shows the basic idea, we add an activation function,φ, between the first andsecond layer. In the case of the multi-layer perceptron, this activation function was implicit,linear and given byφ(x) =x; however, we are not limited to such simple choices forφ.To add non-linearity to our model, some common choices for activation functions are thesigmoid, hyperbolic tangent (written as tanh) and Rectify Linear Unit (aka ReLu, given byReLu(x) = max(0, x)) functions. This simple addition to the model is extremely powerful,and is what enables it to be a universal approximator. For example, it can be shown thatfor a two-layer neural network with sigmoid activations there existsW1, W2such that|f − g| ≤ , for any functiong:Rd→ Rand a fixed(wherefrepresents the neuralnetwork). While this might be a bit hard to understand, it basically says that a non-linearneural network can be fit to any function with arbitrary precision and therefore, with= 0,it is a universal approximator.However, being a universal approximator is not particularly impressive in and of itself, sinceother such approximators exist. For example, a sum of RBF kernels can be a universalapproximator since it can effectively point-wise model a function. The drawback to sucha model is that its size is usually huge since it depends on every single data point of thetraining data. This is where neural networks, particularly deep neural networks (those withmany hidden layers), shine. A recent theorem showed that for certain functions, the size ofa shallow neural network (with few hidden layers) is exponentially sized whereas equivalent,deep networks are small. This means that, comparatively, deep neural networks are effectivemodels.4 17 : Neural Networks with Non-Linear ActivationsOptimizing Neural NetworksWhen it comes to optimizing neural networks, we can think of them as a variant of featureselection. This makes optimizing them similar to other models like: boosting, decision treesand bagging (however, in the case of bagging we simply average the weights instead ofoptimizing them). We can considerzas a function ofxi(wherezonce again represents theoutput of the last hidden layer) that effectively selects and weighs the features. From thisperspective, the optimization problem becomes,minW`,b`,{Wk}nXi=1`(yi, WT`z(xi) + b`) (3)Where W`and b`are the weights and biases of the last layer, respectively.Finding the WeightsTo solve the optimization problem in equation 3, we almost always use a variant of Stochas-tic Gradient Descent (SGD). Of the variants of SGD, we usually use Back Propagationsince it is particularly useful when computing the gradients of complex, nested functions.This is partly because back propagation can avoid the unnecessary re-computations thatstem from many of the nodes in neural networks feeding through the same nodes. Thisexposes a trade-off between speed and memory, since the implementation can favor speedby avoiding re-computations but may have to store a lot of data to do so. Other variantsof back propagation, like RMS-prop, AdaGrad and Adam,

View Full Document


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2 out of 5 pages.

ILLINOIS CS 446 - 102417.1

Sign up for free to view:

Please select your school