**Unformatted text preview:**

CS 188Fall 2019 Exam Prep 11 SolutionsQ1. PerceptronWe would like to use a perceptron to train a classifier for datasets with 2 features per point and labels +1 or -1.Consider the following labeled training data:Features Label(x1, x2) y∗(-1,2) 1(3,-1) -1(1,2) -1(3,1) 1(a) Our two perceptron weights have been initialized to w1= 2 and w2= −2. After processing the first pointwith the perceptron algorithm, what will be the updated values for these weights?For the first point, y = g(w1x1+ w2x2) = g(2 · −1 + −2 · 2) = g(−5) = −1, which is incorrectly classified.To updathe weights, we add the first data point: w1= 2 + (−1) = 1 and w2= −2 + 2 = 0.(b) After how many steps will the perceptron algorithm converge? Write “never” if it will never converge.Note: one steps means processing one point. Points are processed in order and then repeated, untilconvergence.The data is not seperable, so it will never converge.(c) Instead of the standard perceptron algorithm, we decide to treat the perceptron as a single node neuralnetwork and update the weights using gradient descent on the loss function.The loss function for one data point is Loss(y, y∗) = (y − y∗)2, where y∗is the training label for a givenpoint and y is the output of our single node network for that point.(i) Given a general activation function g(z) and its derivative g0(z), what is the derivative of the lossfunction with respect to w1in terms of g, g0, y∗, x1, x2, w1, and w2?∂Loss∂w1= 2(g(w1x1+ w2x2) − y∗)g0(w1x1+ w2x2)x1(ii) For this question, the specific activation function that we will use is:g(z) = 1 if z ≥ 0 and = −1 if z < 0Given the following gradient descent equation to update the weights given a single data point. Withinitial weights of w1= 2 and w2= −2, what are the updated weights after processing the first point?Gradient descent update equation: wi= wi− α∂Loss∂w1Because the gradient of g is zero, the weights will stay w1= 2 and w2= −2.(iii) What is the most critical problem with this gradient descent training process with that activationfunction?The gradient of that activation function is zero, so the weights will not update.2 Neural NetsConsider the following computation graph for a simple neural network for binary classification. Here x is asingle real-valued input feature with an associated class y∗(0 or 1). There are two weight parameters w1andw2, and non-linearity functions g1and g2(to be defined later, below). The network will output a value a2between 0 and 1, representing the probability of being in class 1. We will be using a loss function Loss (to bedefined later, below), to compare the prediction a2with the true class y∗.1xw1∗g1w2∗g2y∗Lossz1→ a1→ z2→ a2→1. Perform the forward pass on this network, writing the output values for each node z1, a1, z2and a2interms of the node’s input values:z1= x ∗ w1a1= g1(z1)z2= a1∗ w2a2= g2(z2)2. Compute the loss Loss(a2, y∗) in terms of the input x, weights wi, and activation functions gi:Recursively substituting the values computed above, we have:Loss(a2, y∗) = Loss(g2(w2∗ g1(w1∗ x)), y∗)3. Now we will work through parts of the backward pass, incrementally. Use the chain rule to derive∂Loss∂w2.Write your expression as a product of partial derivatives at each node: i.e. the partial derivative of thenode’s output with respect to its inputs. (Hint: the series of expressions you wrote in part 1 will behelpful; you may use any of those variables.)∂Loss∂w2=∂Loss∂a2∂a2∂z2∂z2∂w224. Suppose the loss function is quadratic, Loss(a2, y∗) =12(a2−y∗)2, and g1and g2are both sigmoid functionsg(z) =11+e−z(note: it’s typically better to use a different type of loss, cross-entropy, for classificationproblems, but we’ll use this to make the math easier).Using the chain rule from Part 3, and the fact that∂g(z)∂z= g(z)(1 − g(z)) for the sigmoid function, write∂Loss∂w2in terms of the values from the forward pass, y∗, a1, and a2:First we’ll compute the partial derivatives at each node:∂Loss∂a2= (a2− y∗)∂a2∂z2=∂g2(z2)∂z2= g2(z2)(1 − g2(z2)) = a2(1 − a2)∂z2∂w2= a1Now we can plug into the chain rule from part 3:∂Loss∂w2=∂Loss∂a2∂a2∂z2∂z2∂w2= (a2− y∗) ∗ a2(1 − a2) ∗ a15. Now use the chain rule to derive∂Loss∂w1as a product of partial derivatives at each node used in the chainrule:∂Loss∂w1=∂Loss∂a2∂a2∂z2∂z2∂a1∂a1∂z1∂z1∂w16. Finally, write∂Loss∂w1in terms of x, y∗, wi, ai, zi: The partial derivatives at each node (in addition to theones we computed in Part 4) are:∂z2∂a1= w2∂a1∂z1=∂g1(z1)∂z1= g1(z1)(1 − g1(z1)) = a1(1 − a1)∂z1∂a1= xPlugging into the chain rule from Part 5 gives:∂Loss∂w1=∂Loss∂a2∂a2∂z2∂z2∂a1∂a1∂z1∂z1∂w1= (a2− y∗) ∗ a2(1 − a2) ∗ w2∗ a1(1 − a1) ∗ x7. What is the gradient descent update for w1with step-size α in terms of the values computed above?w1← w1− α(a2− y∗) ∗ a2(1 − a2) ∗ w2∗ a1(1 − a1) ∗

View Full Document