10 701 15 781 Fall 2006 Midterm There are 7 questions in this exam 11 pages including this cover sheet Questions are not equally difficult If you need more room to work out your answer to a question use the back of the page and clearly mark on the front of the page if we are to look at what s on the back This exam is open book and open notes Computers PDAs cell phones are not allowed You have 1 hour and 20 minutes Good luck Name Andrew ID Q Topic Max Score 1 Conditional Independece MLE MAP Probability 12 2 Decision Tree 12 3 Neural Network and Regression 18 4 Bias Variance Decomposition 12 5 Support Vector Machine 12 6 Generative vs Discriminative Classifier 20 7 Learning Theory 14 Total 100 1 Score 1 Conditional Independence MLE MAP Probability 12 pts 1 4 pts Show that Pr X Y Z Pr X Z Pr Y Z if Pr X Y Z Pr X Z 2 4 pts If a data point y follows the Poisson distribution with rate parameter then the probability of a single observation y is p y y e y for y 0 1 2 You are given data points y1 yn independently drawn from a Poisson distribution with parameter Write down the log likelihood of the data as a function of 3 4 pts Suppose that in answering a question in a multiple choice test an examinee either knows the answer with probability p or he guesses with probability 1 p Assume that the probability of answering a question correctly is 1 for an examinee who knows the answer and 1 m for the examinee who guesses where m is the number of multiple choice alternatives What is the probability that an examinee knew the answer to a question given that he has correctly answered it 2 2 Decision Tree 12 pts The following data set will be used to learn a decision tree for predicting whether students are lazy L or diligent D based on their weight Normal or Underweight their eye color Amber or Violet and the number of eyes they have 2 or 3 or 4 Weight N N N U U U N N U U Eye Color A V V V V A A V A A Num Eyes 2 2 2 3 3 4 4 4 3 3 Output L L L L L D D D D D The following numbers may be helpful as you answer this problem without using a calculator log2 0 1 3 32 log2 0 2 2 32 log2 0 3 1 73 log2 0 4 1 32 log2 0 5 1 You don t need to show the derivation for your answers in this problem 1 3 pts What is the conditional entropy H EyeColor W eight N 2 3 pts What attribute would the ID3 algorithm choose to use for the root of the tree no pruning 3 4 pts Draw the full decision tree learned for this data no pruning 4 2 pts What is the training set error of this unpruned tree 3 3 Neural Network and Regression 18 pts Consider a two layer neural network to learn a function f X Y where X hX1 X2 i consists of two attributes The weights w1 w6 can be arbitrary There are two possible choices for the function implemented by each unit in this network 1 S signed sigmoid function S a sign a 0 5 sign 1 exp a 0 5 L linear function L a c a P where in both cases a i wi Xi 1 4 pts Assign proper activation functions S or L to each unit in the following graph so this neural network simulates a linear regression Y 1 X1 2 X2 2 4 pts Assign proper activation functions S or L for each unit in the following graph so this neural network simulates a binary logistic regression classifier Y arg maxy P Y y X exp 1 X1 2 X2 1 where P Y 1 X 1 exp P Y 1 X 1 exp 1 X 1 X1 2 X2 1 2 X2 3 3 pts Following problem 3 2 derive 1 and 2 in terms of w1 w6 4 4 4 pts Assign proper activation functions S or L for each unit in the following graph so this neural network simulates a boosting classifier which combines two logistic regression classifiers f1 X Y1 and f2 X Y2 to produce its final prediction Y sign 1 Y1 2 Y2 Use the same definition in problem 3 2 for f1 and f2 5 3 pts Following problem 3 4 derive 1 and 2 in terms of w1 w6 5 4 Bias Variance Decomposition 12 pts 1 6 pts Suppose you have regression data generated by a polynomial of degree 3 Characterize the bias variance of the estimates of the following models on the data with respect to the true model by circling the appropriate entry Linear regression Polynomial regression with degree 3 Polynomial regression with degree 10 Bias low high low high low high Variance low high low high low high 2 Let Y f X where has mean zero and variance 2 In k nearest neighbor kNN regression the prediction of Y at point x0 is given by the average of the values Y at the k neighbors closest to x0 a 2 pts Denote the nearest neighbor to x0 by x and its corresponding Y value by y Write the prediction f x0 of the kNN regression for x0 in terms of y 1 k b 2 pts What is the behavior of the bias as k increases c 2 pts What is the behavior of the variance as k increases 6 5 Support Vector Machine 12 pts Consider a supervised learning problem in which the training examples are points in 2 dimensional space The positive examples are 1 1 and 1 1 The negative examples are 1 1 and 1 1 1 1 pts Are the positive examples linearly separable from the negative examples in the original space 2 4 pts Consider the feature transformation x 1 x1 x2 x1 x2 where x1 and x2 are respectively the first and second coordinates of a generic example x The prediction function is y x wT x in this feature space Give the coefficients w of a maximum margin decision surface separating the positive examples from the negative examples You should be able to do this by inspection without any significant computation 3 3 pts Add one training example to the graph so the total five examples can no longer be linearly separated in the feature space x defined in problem 5 2 4 4 pts What kernel K x x0 does this feature transformation correspond to 7 6 Generative vs Discriminative Classifier 20 pts Consider the binary classification problem where class label Y 0 1 and each training example X has 2 binary attributes X1 X2 0 1 In this problem we will always assume X1 and X2 are conditional independent given Y that the class priors are P Y …
View Full Document