MIT 6 867 - Study Guide - D2139351

Home> Schools> Massachusetts Institute of Technology> Electrical Engineering and Computer Science (6) > 6 867> Study Guide

MIT 6 867 - Study Guide

School name Massachusetts Institute of Technology

Course 6 867- Machine Learning

Pages 10

Download Save

Unformatted text preview:

6 867 Machine learning Mid term exam October 8 2003 2 points Your name and MIT ID Problem 1 In this problem we use sequential active learning to estimate a linear model y w1 x w0 where the input space x values are restricted to be within 1 1 The noise term is assumed to be a zero mean Gaussian with an unknown variance 2 Recall that our sequential active learning method selects input points with the highest variance in the predicted outputs Figure 1 below illustrates what outputs would be returned for each query the outputs are not available unless speci cally queried We start the learning algorithm by querying outputs at two input points x 1 and x 1 and let the sequential active learning algorithm select the remaining query points 1 4 points Give the next two inputs that the sequential active learning method would pick Explain why 1 Cite as Tommi Jaakkola course materials for 6 867 Machine Learning Fall 2006 MIT OpenCourseWare http ocw mit edu Massachusetts Institute of Technology Downloaded on DD Month YYYY 1 5 y 1 0 5 0 0 5 1 0 5 0 x 0 5 1 Figure 1 Samples from the underlying relation between the inputs x and outputs y The outputs are not available to the learning algorithm unless speci cally queried 2 4 points In the gure 1 above draw approximately the linear relation between the inputs and outputs that the active learning method would nd after a large number of iterations 3 6 points Would the result be any di erent if we started with query points x 0 and x 1 and let the sequential active learning algorithm select the remaining query points Explain why or why not 2 Cite as Tommi Jaakkola course materials for 6 867 Machine Learning Fall 2006 MIT OpenCourseWare http ocw mit edu Massachusetts Institute of Technology Downloaded on DD Month YYYY Problem 2 0 log probability average log probability of training labels 0 2 average log probability of test labels 0 4 0 0 5 1 1 5 2 2 5 regularization parameter C 3 3 5 4 Figure 2 Log probability of labels as a function of regularization parameter C Here we use a logistic regression model to solve a classi cation problem In Figure 2 we have plotted the mean log probability of labels in the training and test sets after having trained the classi er with quadratic regularization penalty and di erent values of the regularization parameter C 1 T F 2 points In training a logistic regression model by maximizing the likelihood of the labels given the inputs we have multiple locally optimal solutions 2 T F 2 points A stochastic gradient algorithm for training logistic regression models with a xed learning rate will nd the optimal setting of the weights exactly 3 T F 2 points The average log probability of training labels as in Figure 2 can never increase as we increase C 3 Cite as Tommi Jaakkola course materials for 6 867 Machine Learning Fall 2006 MIT OpenCourseWare http ocw mit edu Massachusetts Institute of Technology Downloaded on DD Month YYYY 4 4 points Explain why in Figure 2 the test log probability of labels decreases for large values of C 5 T F 2 points The log probability of labels in the test set would decrease for large values of C even if we had a large number of training examples 6 T F 2 points Adding a quadratic regularization penalty for the parameters when estimating a logistic regression model ensures that some of the parameters weights associated with the components of the input vectors vanish Problem 3 Consider a training set consisting of the following eight examples Examples labeled 0 Examples labeled 1 3 3 0 2 2 0 3 3 1 1 1 1 3 3 0 1 1 0 2 2 1 1 1 1 The questions below pertain to various feature selection methods that we could use with the logistic regression model 1 2 points What is the mutual information between the third feature and the target label based on the training set 2 2 points Which feature s would a lter feature selection method choose You can assume here that the mutual information criterion is evaluated between a single feature and the label 4 Cite as Tommi Jaakkola course materials for 6 867 Machine Learning Fall 2006 MIT OpenCourseWare http ocw mit edu Massachusetts Institute of Technology Downloaded on DD Month YYYY 3 2 points Which two feature s would a greedy wrapper process choose 4 4 3points Which features would a regularization approach with a 1 norm penalty i 1 wi choose Explain brie y Problem 4 1 6 points Figure 3 shows the rst decision stump that the AdaBoost algorithm nds starting with the uniform weights over the training examples We claim that the weights associated with the training examples after including this decision stump will be 1 8 1 8 1 8 5 8 the weights here are enumerated as in the gure Are these weights correct why or why not Do not provide an explicit calculation of the weights 2 T F 2 points The votes that AdaBoost algorithm assigns to the component classi ers are optimal in the sense that they ensure larger margins in the training set higher majority predictions than any other setting of the votes 3 T F 2 points In the boosting iterations the training error of each new decision stump and the training error of the combined classi er vary roughly in concert 5 Cite as Tommi Jaakkola course materials for 6 867 Machine Learning Fall 2006 MIT OpenCourseWare http ocw mit edu Massachusetts Institute of Technology Downloaded on DD Month YYYY 1 x x 1 1 o 1 x 1 1 1 2 3 4 Figure 3 The rst decision stump that the boosting algorithm nds Problem 5 x 2 x x x x x x x x x x o x x o o o o o o o o o x 1 Figure 4 Training set maximum margin linear separator and the support vectors in bold 1 4 points What is the leave one out cross validation error estimate for maximum margin separation in gure 4 we are asking for a number 2 T F 2 points We would expect the support vectors to remain the same in general as we move from a linear kernel to higher order polynomial kernels 3 T F 2 points Structural risk minimization is guaranteed to nd the model among those considered with the lowest expected loss 6 Cite as Tommi Jaakkola course materials for 6 867 Machine Learning Fall 2006 MIT OpenCourseWare http ocw mit edu Massachusetts Institute of Technology Downloaded on DD Month YYYY 4 6 points What is the VC dimension of a mixture of two Gaussians model in the plane with equal covariance matrices Why Problem 6 Using a set of 100 labeled training examples two classes we train the following models GaussI A Gaussian mixture model one Gaussian per class where the covariance matrices are both set to I identity matrix

View Full Document


School:
Email:
New Password:
Confirm Password:

MIT 6 867 - Study Guide

Sign up for free to view:

Please select your school