DOC PREVIEW
MIT 6 867 - Mid-Term Exam

This preview shows page 1-2-3-4 out of 11 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 11 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 11 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 11 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 11 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 11 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

6.867 Machine learning Mid-term exam October 13, 2006 (2 points) Your name and MIT ID: Problem 1 Suppose we are trying to solve an active learning problem, where the possible inputs you can select form a discrete set. Specifically, we have a set of N unlabeled documents, Φ1, . . . , ΦN , where each document is represented as a binary feature fector Φ = [φ1, . . . , φm]T and φi = 1 if word i appears in the document and zero otherwise. Our goal is to quickly label these N documents with 0/1 labels. We can request a label for any of the N documents, preferably as few as possible. We also have a small set of n already labeled documents to get us started. We use a logistic regression model to solve the classification task: P (y = 1|Φ, w) = g( w T Φ ) where g( ) is the logistic function. Note that we do not include the bias term. ·1. (T/F – 2 points) Any word that appears in all the N documents would effectively provide a bias term for the logistic regression model. 2. (T/F – 2 points) Any word that appears only in the available n la-beled documents used for initially training the logistic regression model, would serve equally well as a bias term. 1 Cite as: Tommi Jaakkola, course materials for 6.867 Machine Learning, Fall 2006. MIT OpenCourseWare(http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].3. Having trained the logistic regression model on the basis of the n labeled documents, obtaining wˆn, we’d like to request additional labeled documents. For this, we will use the following measure of uncertainty in our predictions: Ey∼pt |y − pt| = pt|1 − pt| + (1 − pt)|0 − pt| = 2pt(1 − pt) where pt = P (y = 1|Φt, wˆn), our current prediction of the probability that y = 1 for the tth unlabeled document Φt. a) (4 points) We would request the label for the document/query point Φt that has ( ) the smallest value of 2pt(1 − pt) ( ) the largest value of 2pt(1 − pt) ( ) an intermediate value of 2pt(1 − pt) Briefly explain the rationale behind the selection criterion that you chose. b) (2 points) Sketch wˆn in Figure 1.1. Write down the equation, expressed solely in terms of Φ and wˆn, that Φ has to satisfy for it to lie exactly on the decision boundary: 2 Cite as: Tommi Jaakkola, course materials for 6.867 Machine Learning, Fall 2006. MIT OpenCourseWare(http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].c) (4 points) In figure 1.2, circle the next point we would select according to the criterion. Draw two decision boundaries that would result from incorporating the new point in the training set, labeling the boundaries as y = 1 and y = 0, depending on the outcome of the query. + o Φt = [φt1, . . . , φtm]T + o Figure 1.1. Two labeled p oints, unlabeled Figure 1.2. Two labeled p oints, unlabeled points, and the decision boundary. The points, and the decision boundary. The point “+” corresponds to y = 1. point “+” corresponds to y = 1. 4. (T/F – 2 points) The criterion we have us ed here for active learning guarantees that the measure of uncertainty about the labels of the unlabeled points will decrease monotonically for each point after each query. Problem 2 Consider a regression problem where the two dimensional input points x = [x1, x2]T are constrained to lie within the unit square: xi ∈ [−1, 1], i = 1, 2. The training and test input points x are sampled uniformly at random within the unit square. The target outputs y are governed by the following model y ∼ N(x13 x25 − 10x1x2 + 7x12 + 5x2 − 3, 1) In other words, the outputs are normally distributed with mean given by x13 x25 − 10x1x2 + 7x12 + 5x2 − 3 and variance 1. We learn to predict y given x using linear regression models with 1st through 10th order polynomial features. The models are nested in the sense that the higher order models will include all the lower order features. The estimation criterion is the mean squared error. 3 Cite as: Tommi Jaakkola, course materials for 6.867 Machine Learning, Fall 2006. MIT OpenCourseWare(http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].We first train a 1st, 2nd, 8th, and 10th order model using n = 20 training points, and then test the predictions on a large number of independently sampled points. 1. (6 points) Select all the appropriate model(s) for each column. If you think the highest, or lowest, error would be shared among several models, be sure to list all models. Lowest test error Lowest training error Highest training error (typically) 1st order ( ) ( ) 2nd order ( ) ( ) ( ) 8th order ( ) ( ) 10th order ( ) ( ) ( ) Briefly explain your selection in the last column, i.e., the model you would expect to have the lowest test error: 2. (6 points) We now train the polynomial regression models using n = 106 (one million) training points. Again select the appropriate model(s) for each column. If you think the highest, or lowest, error would be shared among several models, be sure to list all models. Lowest structural error Highest approx. error Lowest test error 1st order ( ) ( ) ( ) 2nd order ( ) ( ) ( ) 8th order ( ) ( ) ( ) 10th order ( ) ( ) ( ) 3. (T/F – 2 points) The approximation error of a polynomial regression model depends on the number of training points. 4. (T/F – 2 points) The structural error of a polynomial regression model depends on the number of training points. 4 Cite as: Tommi Jaakkola, course materials for 6.867 Machine Learning, Fall 2006. MIT OpenCourseWare(http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].Problem 3 We consider here linear and non-linear support vector machines (SVM) of the form: min w12/2 subject to yi(w1xi + w0) − 1 ≥ 0, i = 1, . . . , n, or min w T w/2 subject to yi(w T Φi + w0) − 1 ≥ 0, i = 1, . . . , n where Φi is a feature vector constructed from the corresponding real valued input xi. We wish to compare the simple linear SVM classifier (w1x + w0) and the non-linear classifier (wT Φ + w0), where Φ = [x, x2]T . 1. (3 points) Provide three input points x1, x2, and x3 and their associated ±1 labels such that they cannot be separated with the simple linear classifier, but are separable by the non-linear


View Full Document

MIT 6 867 - Mid-Term Exam

Download Mid-Term Exam
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Mid-Term Exam and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Mid-Term Exam 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?