6.867 Machine learning Mid-term exam October 13, 2004 (2 points) Your name and MIT ID: Problem 1 −1 0 1−101xnoise−1 0 1−101xnoise−1 0 1−101xnoiseAB C 1. (6 points) Each plot ab ove claims to represent prediction errors as a function of x for a trained regression model based on some dataset. Some of these plots could potentially be prediction errors for linear or quadratic regression models, while oth-ers couldn’t. The regression models are trained with the least squares estimation criterion. Please indicate compatible models and plots. A B C linear regression ( ) ( ) ( ) quadratic regression ( ) ( ) ( ) 1 Cite as: Tommi Jaakkola, course materials for 6.867 Machine Learning, Fall 2006. MIT OpenCourseWare(http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].Problem 2 Here we explore a regression model where the noise variance is a function of the input (variance increases as a function of input). Specifically y = wx + � where the noise � is normally distributed with mean 0 and standard deviation σx. The value of σ is assumed known and the input x is restricted to the interval [1, 4]. We can write the model more compactly as y ∼ N(wx, σ2x2). If we let x vary within [1, 4] and sample outputs y from this model with some w, the regression plot might look like 1 2 3 40246810xy1. (2 points) How is the ratio y/x distributed for a fixed (constant) x? 2. Suppose we now have n training points and targets {(x1, y1), (x2, y2), . . . , (xn, yn)}, where each xi is chosen at random from [1, 4] and the corresponding yi is subsequently sampled from yi ∼ N(w∗xi, σ2xi 2) with some true underlying parameter value w∗; the value of σ2 is the same as in our model. 2 Cite as: Tommi Jaakkola, course materials for 6.867 Machine Learning, Fall 2006. MIT OpenCourseWare(http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].(a) (3 points) What is the maximum-likelihood estimate of w as a function of the training data? (b) (3 points) What is the variance of this estimator due to the noise in the target outputs as a function of n and σ2 for fixed inputs x1, . . . , xn? For later utility (if you omit this answer) you can denote the answer as V (n, σ2). Some potentially useful relations: if z ∼ N(µ, σ2), then az ∼ N(aµ, σ2a2) for a fixed a. If z1 ∼ N(µ1, σ12) and z2 ∼ N(µ2, σ22) and they are independent, then Var(z1 + z2) = σ12 + σ22 . 3. In sequential active learning we are free to choose the next training input xn+1, here within [1, 4], for which we will then receive the corresponding noisy target yn+1, sam-pled from the underlying model. Suppose we already have {(x1, y1), (x2, y2), . . . , (xn, yn)}and are trying to figure out which xn+1 to select. The goal is to choose xn+1 so as to help minimize the variance of the predictions f(x; ˆwn) = wˆnx, where ˆwn is the maxi-mum likelihood estimate of the parameter w based on the first n training examples. (a) (2 points) What is the variance of f(x; ˆwn) due to the noise in the training out-puts as a function of x, n, and σ2 given fixed (already chosen) inputs x1, . . . , xn? (b) (2 points) Which xn+1 would we choose (within [1, 4]) if we were to next select x with the maximum variance of f(x; ˆwn)? (c) (T/F – 2 points) Since the variance of f(x; ˆwn) only depends on x, n, and σ2, we could equally well select the next point at random from [1, 4] and obtain the same reduction in the maximum variance. 3 Cite as: Tommi Jaakkola, course materials for 6.867 Machine Learning, Fall 2006. MIT OpenCourseWare(http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].−2 −1.5 −1 −0.5 0 0.5 1 1.5 200.10.20.30.40.50.60.70.80.91(1) P (y = 1|x, ˆw) (2) P (y = 1|x, ˆw) y = 0 y = 1 y = 0 Figure 1: Two possible logistic regression solutions for the three labeled points. Problem 3 Consider a simple one dimensional logistic regression model P (y = 1|x, w) = g(w0 + w1x) where g(z) = (1 + exp(−z))−1 is the logistic function. 1. Figure 3 shows two possible conditional distributions P (y = 1|x, w), viewed as a function of x, that we can get by changing the parameters w. (a) (2 points) Please indicate the number of classification errors for each condi-tional given the labeled examples in the same figure Conditional (1) makes ( ) classification errors Conditional (2) makes ( ) classification errors (b) (3 points) One of the conditionals in Figure 3 corresponds to the maximum likelihood setting of the parameters wˆbased on the labeled data in the figure. Which one is the ML solution (1 or 2)? (c) (2 points) Would adding a regularization penalty |w1|2/2 to the log-likelihood estimation criterion affect your choice of solution (Y/N)? 4 Cite as: Tommi Jaakkola, course materials for 6.867 Machine Learning, Fall 2006. MIT OpenCourseWare(http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].0 50 100 150 200 250 300−1.5−1−0.500.51number of training examplesexpected log−likelihood of test labelsFigure 2: The expected log-likelihood of test labels as a function of the number of training examples. 2. (4 points) We can estimate the logistic regression parameters more accurately with more training data. Figure 2 shows the expected log-likelihood of test labels for a simple logistic regression model as a function of the number of training examples and labels. Mark in the figure the structural error (SE) and approximation error (AE), where “error” is measured in terms of log-likelihood. 3. (T/F – 2 points) In general for s mall training sets, we are likely to reduce the approximation error by adding a regularization penalty |w1|2/2 to the log-likelihood criterion. 5 Cite as: Tommi Jaakkola, course materials for 6.867 Machine Learning, Fall 2006. MIT OpenCourseWare(http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].x2 (0,1) (1,1) o x (0,0) (1,0) x o x1 Figure 3: Equally likely input configurations in the training set Problem 4 Here we will look at methods for selecting input features for a logistic regression model P (y = 1|x, w) = g(w0 + w1x1 + w2x2) The available training examples are very simple, involving only binary
View Full Document