**Unformatted text preview:**

Problem Set 1CS 6375Due: 2/10/2022 by 11:59pmNote: all answers should be accompanied by explanations for full credit. All code used as part ofyour solutions should be included for partial credit. Late homeworks will not be accepted.Warm-Up: Subgradients & More (15 pts)1. Recall that a function f : Rn→ R is convex if for all x, y ∈ Rnand λ ∈ [0, 1], λf (x) + (1 −λ)f(y) ≥ f(λx + (1 − λ)y). Using this definition, show that(a) f(x) = wf1(x) is a convex function for x ∈ Rnwhenever f1: Rn→ R is a convexfunction and w ≥ 0(b) f(x) = f1(x) + f2(x) is a convex function for x ∈ Rnwhenever f1: Rn→ R andf2: Rn→ R are convex functions(c) f(x) = max{f1(x), f2(x)} is a convex function for x ∈ Rnwhenever f1: Rn→ R andf2: Rn→ R are convex functions2. Compute a subgradient at the specified points for each of the following convex functions.(a) f(x) = max{x2− 2x, |x|} at x = 0, x = −2, and x = 1.(b) g(x) = max{(x − 1)2, (x − 2)2} at x = 1.5 and x = 0.Problem 1: Perceptron Learning (45 pts)Consider the data set (perceptron.data) attached to this homework. This data file consists of Mdata elements of the form (x(m)1, x(m)2, x(m)3, x(m)4, y(m)) where x(m)1, . . . , x(m)4∈ R define a data pointin R4and y(m)∈ {−1, 1} is the corresponding class label.1. In class, we saw how to use the perceptron algorithm to minimize the following loss function.1MMXm=1max{0, −y(m)· (wTx(m)+ b)}What is the smallest, in terms of number of data points, two-dimensional data set containingboth class labels on which the perceptron algorithm, with step size one, fails to converge?Use this example to explain why the method may fail to converge more generally.12. Consider the following alternative loss function.1MMXm=1max{0, 1 − y(m)· (wTx(m)+ b)}2(a) If the data is linearly separable, does this loss function have any local optima that arenot global optima?(b) For each optimization strategy below, report the number of iterations that it takes tofind a perfect classifier for the data, the values of w and b for the first three iterations,and the final weights and bias. Each descent procedure should start from the initialpointw0=0000b0= 0.i. Standard subgradient descent with the step size γt= 1 for each iteration.ii. Stochastic subgradient descent where exactly one component of the sum is chosen toapproximate the gradient at each iteration. Instead of picking a random componentat each iteration, you should iterate through the data set starting with the firstelement, then the second, and so on until the Mthelement, at which point youshould start back at the beginning again. Again, use the step size γt= 1.iii. How does the rate of convergence change as you change the step size? Provide someexample step sizes to back up your statements.(c) Does your subgradient descent implementation with step size one always converge usingthis loss? Explain.Problem 2: Separability & Feature Vectors (15 pts)1. Consider the following data set.−2 −1 0 1 2−2−1012x1x2Under which of the following feature vectors is the data linearly separable? For full credit,you must justify your answer by either providing a linear separator or explaining why such a2separator does not exist.(a) φ(x1, x2) =x1+ x2x1− x2(c) φ(x1, x2) =exp(x1)exp(x2)(b) φ(x1, x2) =x21x22x1x2(d) φ(x1, x2) =x1· sin(x2)x12. Suppose that you wanted to perform polynomial regression for 2-dimensional data pointsusing gradient descent, i.e., you want to fit a polynomial of degree k to your data. Explainhow to do this using feature vectors. What is the per iteration complexity of gradient descentas a function of the size of your feature representation and the number of training data points?Problem 3: Support Vector Machines (25 pts)For this problem, consider the data set (mystery.data) attached to this homework that, like Problem1, contains four numeric attributes per row and the fifth entry is the class variable (either + or-). Find a perfect classifier for this data set using support vector machines. Your solution shouldexplain the optimization problem that you solved and provide the learned parameters, the optimalmargin, and the support vectors. Note, for full credit, your solution should only make use of aquadratic programming solver, i.e., you should not rely on existing SVM

View Full Document