DOC PREVIEW
MIT 6 867 - Study Guide

This preview shows page 1-2-3 out of 10 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 10 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 10 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 10 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 10 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

6.867 Machine learning Mid-term exam October 8, 2003 (2 points) Your name and MIT ID: Problem 1 In this problem we use se quential active learning to estimate a linear model y = w1x + w0 + � where the input space (x values) are restricted to be within [−1, 1]. The noise term � is assumed to be a zero mean Gaussian with an unknown variance σ2 . Recall that our sequential active learning method selects input points with the highest variance in the predicted outputs. Figure 1 below illustrates what outputs would be returned for each query (the outputs are not available unless specifically queried). We start the learning algorithm by querying outputs at two input points, x = −1 and x = 1, and let the sequential active learning algorithm select the remaining query points. 1. (4 points) Give the next two inputs that the sequential active learning method would pick. Explain why. 1 Cite as: Tommi Jaakkola, course materials for 6.867 Machine Learning, Fall 2006. MIT OpenCourseWare(http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].−1 −0.5 0 0.5 1−0.500.511.5xyFigure 1: Samples from the underlying relation between the inputs x and outputs y. The outputs are not available to the learning algorithm unless specifically queried. 2. (4 points) In the figure 1 above, draw (approximately) the linear relation between the inputs and outputs that the active learning method would find after a large number of iterations. 3. (6 points) Would the result be any different if we started with query points x = 0 and x = 1 and let the sequential active learning algorithm select the remaining query points? Explain why or why not. 2 Cite as: Tommi Jaakkola, course materials for 6.867 Machine Learning, Fall 2006. MIT OpenCourseWare(http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].Problem 2 0 0.5 1 1.5 2 2.5 3 3.5 4−0.4−0.20regularization parameter Clog−probabilityaverage log−probability of test labelsaverage log−probability of training labelsFigure 2: Log-probability of labels as a function of regularization parameter C Here we use a logistic regression model to solve a classification problem. In Figure 2, we have plotted the mean log-probability of labels in the training and test sets after having trained the classifier with quadratic regularization penalty and different values of the regularization parameter C. 1. (T/F – 2 points) In training a logistic regression model by maximizing the likelihood of the labels given the inputs we have multiple locally optimal solutions. 2. (T/F – 2 points) A stochastic gradient algorithm for training logistic regression models with a fixed learning rate will find the optimal setting of the weights exactly. 3. (T/F – 2 points) The average log-probability of training labels as in Figure 2 can never increase as we increase C. 3 Cite as: Tommi Jaakkola, course materials for 6.867 Machine Learning, Fall 2006. MIT OpenCourseWare(http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].4. (4 points) Explain why in Figure 2 the test log-probability of labels decreases for large values of C. 5. (T/F – 2 points) The log-probability of labels in the test set would decrease for large values of C even if we had a large number of training examples. 6. (T/F – 2 points) Adding a quadratic regularization penalty for the parameters when estimating a logistic regression model ensures that some of the parameters (weights associated with the components of the input vectors) vanish. Problem 3 Consider a training set consisting of the following eight examples: Examples labeled “0” Examples labeled “1” 3,3,0 2,2,0 3,3,1 1,1,1 3,3,0 1,1,0 2,2,1 1,1,1 The questions below pertain to various feature selection methods that we could use with the logistic regression model. 1. (2 points) What is the mutual inf ormation between the third feature and the target label based on the training set? 2. (2 points) Which feature(s) would a filter feature selection method choose? You can assume here that the mutual information criterion is evaluated between a single feature and the label. 4 Cite as: Tommi Jaakkola, course materials for 6.867 Machine Learning, Fall 2006. MIT OpenCourseWare(http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].3. (2 points) Which two feature(s) would a greedy wrapper process choose? 4. (4 points) Which features would a regularization approach with a 1-norm p enalty �3 i=1 |wi| choose? Explain briefly. Problem 4 1. (6 points) Figure 3 shows the first decision stump that the AdaBoost algorithm finds (starting with the uniform weights over the training examples). We claim that the weights associated with the training examples after including this decision stump will be [1/8, 1/8, 1/8, 5/8] (the weights here are enumerated as in the figure). Are these weights correct, why or why not? Do not provide an explicit calculation of the weights. 2. (T/F – 2 points) The votes that AdaBoost algorithm assigns to the component classifiers are optimal in the sense that they ensure larger “margins” in the training set (higher majority predictions) than any other setting of the votes. 3. (T/F – 2 points) In the boosting iterations, the training error of each new decision stump and the training error of the combined classifier vary roughly in concert 5 . Cite as: Tommi Jaakkola, course materials for 6.867 Machine Learning, Fall 2006. MIT OpenCourseWare(http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].x x o x+1+1 +1 −1 +1−14321Figure 3: The first decision stump that the boosting algorithm finds. Problem 5 x1x2xxxxxxxxxxoooooooxxoooFigure 4: Training set, maximum margin linear separator, and the support vectors (in bold). 1. (4 points) What is the leave-one-out cross-validation error estimate for maximum margin separation in figure 4? (we are asking for a number) 2. (T/F – 2 points) We would expect the support vectors to remain the same in general as we move from a linear kernel to higher order polynomial kernels. 3. (T/F – 2 points) Structural risk minimization is guaranteed to find the model (among


View Full Document

MIT 6 867 - Study Guide

Download Study Guide
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Study Guide and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Study Guide 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?