6 867 Machine learning Mid term exam October 13 2004 2 points Your name and MIT ID Problem 1 1 0 1 1 0 x 1 1 noise noise noise 1 0 1 1 A 0 x 1 0 1 1 0 x 1 C B 1 6 points Each plot above claims to represent prediction errors as a function of x for a trained regression model based on some dataset Some of these plots could potentially be prediction errors for linear or quadratic regression models while oth ers couldn t The regression models are trained with the least squares estimation criterion Please indicate compatible models and plots linear regression quadratic regression A B C 1 Cite as Tommi Jaakkola course materials for 6 867 Machine Learning Fall 2006 MIT OpenCourseWare http ocw mit edu Massachusetts Institute of Technology Downloaded on DD Month YYYY Problem 2 Here we explore a regression model where the noise variance is a function of the input variance increases as a function of input Speci cally y wx where the noise is normally distributed with mean 0 and standard deviation x The value of is assumed known and the input x is restricted to the interval 1 4 We can write the model more compactly as y N wx 2 x2 If we let x vary within 1 4 and sample outputs y from this model with some w the regression plot might look like 10 8 y 6 4 2 0 1 2 3 4 x 1 2 points How is the ratio y x distributed for a xed constant x 2 Suppose we now have n training points and targets x1 y1 x2 y2 xn yn where each xi is chosen at random from 1 4 and the corresponding yi is subsequently sampled from yi N w xi 2 x2i with some true underlying parameter value w the value of 2 is the same as in our model 2 Cite as Tommi Jaakkola course materials for 6 867 Machine Learning Fall 2006 MIT OpenCourseWare http ocw mit edu Massachusetts Institute of Technology Downloaded on DD Month YYYY a 3 points What is the maximum likelihood estimate of w as a function of the training data b 3 points What is the variance of this estimator due to the noise in the target outputs as a function of n and 2 for xed inputs x1 xn For later utility if you omit this answer you can denote the answer as V n 2 Some potentially useful relations if z N 2 then az N a 2 a2 for a xed a If z1 N 1 12 and z2 N 2 22 and they are independent then Var z1 z2 12 22 3 In sequential active learning we are free to choose the next training input xn 1 here within 1 4 for which we will then receive the corresponding noisy target yn 1 sam pled from the underlying model Suppose we already have x1 y1 x2 y2 xn yn and are trying to gure out which xn 1 to select The goal is to choose xn 1 so as to help minimize the variance of the predictions f x w n w n x where w n is the maxi mum likelihood estimate of the parameter w based on the rst n training examples a 2 points What is the variance of f x w n due to the noise in the training out puts as a function of x n and 2 given xed already chosen inputs x1 xn b 2 points Which xn 1 would we choose within 1 4 if we were to next select x with the maximum variance of f x w n c T F 2 points Since the variance of f x w n only depends on x n and 2 we could equally well select the next point at random from 1 4 and obtain the same reduction in the maximum variance 3 Cite as Tommi Jaakkola course materials for 6 867 Machine Learning Fall 2006 MIT OpenCourseWare http ocw mit edu Massachusetts Institute of Technology Downloaded on DD Month YYYY 1 0 9 0 8 2 P y 1 x w 0 7 0 6 0 5 0 4 1 P y 1 x w 0 3 0 2 0 1 0 2 y 0 1 5 1 y 1 0 5 0 y 0 0 5 1 1 5 2 Figure 1 Two possible logistic regression solutions for the three labeled points Problem 3 Consider a simple one dimensional logistic regression model P y 1 x w g w0 w1 x where g z 1 exp z 1 is the logistic function 1 Figure 3 shows two possible conditional distributions P y 1 x w viewed as a function of x that we can get by changing the parameters w a 2 points Please indicate the number of classi cation errors for each condi tional given the labeled examples in the same gure Conditional 1 makes classi cation errors Conditional 2 makes classi cation errors b 3 points One of the conditionals in Figure 3 corresponds to the maximum likelihood setting of the parameters w based on the labeled data in the gure Which one is the ML solution 1 or 2 c 2 points Would adding a regularization penalty w1 2 2 to the loglikelihood estimation criterion a ect your choice of solution Y N 4 Cite as Tommi Jaakkola course materials for 6 867 Machine Learning Fall 2006 MIT OpenCourseWare http ocw mit edu Massachusetts Institute of Technology Downloaded on DD Month YYYY expected log likelihood of test labels 1 0 5 0 0 5 1 1 5 0 50 100 150 200 number of training examples 250 300 Figure 2 The expected log likelihood of test labels as a function of the number of training examples 2 4 points We can estimate the logistic regression parameters more accurately with more training data Figure 2 shows the expected log likelihood of test labels for a simple logistic regression model as a function of the number of training examples and labels Mark in the gure the structural error SE and approximation error AE where error is measured in terms of log likelihood 3 T F 2 points In general for small training sets we are likely to reduce the approximation error by adding a regularization penalty w1 2 2 to the log likelihood criterion 5 Cite as Tommi Jaakkola course materials for 6 867 Machine Learning Fall 2006 MIT OpenCourseWare http ocw mit edu Massachusetts Institute of Technology Downloaded on DD Month YYYY x2 0 1 o 1 1 0 0 1 0 x x o x1 Figure 3 Equally likely input con gurations in the training set Problem 4 Here we will look at methods for selecting input features for a logistic regression model P y 1 x w g w0 w1 x1 w2 x2 The available training examples are very simple involving only binary valued inputs Number of copies 10 10 10 10 x1 1 0 1 0 x2 1 1 0 0 y 1 0 0 1 So for example …
View Full Document
Unlocking...