Bayes Net [Carl, 40 points]Multi-class Logistic Regression with Applications to Handwritten Digit Recognition [Xi, 60 points]10-701 Machine Learning, Spring 2011: Homework 3Due: Feb 17th, at the beginning of classInstructions There are 2 questions on this assignment. The last question involves coding. Please submityour writeup as 2 separate sets of pages according to TAs, with your name and userid on each set.1 Bayes Net [Carl, 40 points]Figure 1: Graphical model for question 1.11. In this problem, you will study the behavior of the simple graphical model shown in figure 1, oftencalled a ‘collider’. Here, the probability of Z depends on two variables X and Y .(a) Write the factored distribution represented by this graph. That is, write the distribution as aproduct of factors each of which could be replaced by entries in the conditional probability tables(CPTs) in order to obtain the probability of an event. [4 points](b) Assume that X,Y , and Z are binary. Write the CPTs for a distribution such that P (X = 1) <P (X = 1|Y = 1, Z = 1) < P (X = 1|Z = 1). Make sure to include in your solution the values ofP (X = 1|Y = 1, Z = 1), and P (X = 1|Z = 1). [8 points]Figure 2: Graphical model for question 1.22. Let P (A, B, C, D, E, F, G, H) factorize according to the graph given in figure 2. Prove or disprovewhether the following independence properties hold. [3 points each](a) (A⊥H)(b) (B⊥G)(c) (B⊥G|E)1(d) (C⊥D|G, A)(e) (C⊥D)(f) (B⊥D|C)Figure 3: Graphical model for question 1.33. Let X = {X1, ..., Xn} have a distribution P (X) that factorizes according to the graph given in figure 3,and assume each Xiis binary. Design a polynomial-time algorithm which will compute the followingquantities. Use pseudocode for your answer. In your pseudocode, assume that the variables px1 andpx already exist, where px1 is the prior probability P (X1= 1), and px is an n-1 by 2 dimensionalarray, where px[i][j] is P (Xi+1= 1|Xi= j − 1), for j ∈ {1, 2}. For this problem, you may assumethat the numbers have arbitrary precision, so you do not need to worry about underflow.(a) Compute P (Xn= 1|X1= x1). Assume that x1 has already been initialized with the value ofX1∈ {0, 1}. [5 points](b) Compute P (X1= 1|Xn= xn). Assume that xn has already been initialized with the value ofXn∈ {0, 1}. [5 points]2 Multi-class Logistic Regression with Applications to Handwrit-ten Digit Recognition [Xi, 60 points]Recall the multi-class logistic regression model on page 6 of the lecture on Jan 27th. Assume that we haveK different classes and each input x is a d dimensional vector. The posterior probability is given by:P (Y = k|X = x) =exp(wTkx)1 +PK−1l=1exp(wTlx)for k = 1, . . . , K − 1P (Y = K|X = x) =11 +PK−1l=1exp(wTlx),where wTkdenotes the transpose of the wkvector and wTkx is the inner product of wkand x (i.e., wTkx =Piwkixi). To simplify, we ignore the constant term wk0in the logistic regression model.For ease of notation, we can replace the two above expressions for P (Y |X) by a single expression. We dothis by introducing a fixed, pseudo parameter vector: wK= 0. Now for any class label k, we simply writeP (Y = k|X = x) =exp(wTkx)1 +PK−1l=1exp(wTlx)for k = 1, . . . , K1. How many parameters do we need to estimate? What are these parameters? [5pt]2. Given n training samples: {(x1, y1), (x2, y2), . . . , (xn, yn)}, please write down explicitly the log likeli-hood function and simplify it as much as you can:L(w1, . . . , wK−1) =nXi=1ln P (Y = yi|X = xi).[5pt]3. Compute the gradient of L with respect to each wkand simply it. [5pt]24. Now we add the regularization termλ2PK−1l=1kwlk22and define a new objective function:f(w1, . . . , wK−1) = L(w1, . . . , wK−1) −λ2K−1Xl=1kwlk22.Compute the gradient of f with respect to each wk. [5pt]5. In this part, you are going to play with the “handwritten digit recognition” problem on the USPSdigit dataset. Each hand-written digital image is 16 by 16 pixels. If we treat the value of each pixelas a boolean feature (either 0 for black or 1 for white), then each example has 16 × 16 = 256 {0, 1}-valued features, and hence x has 256 dimension. Each digit (i.e. 1,2,3,4,5,6,7,8,9,0) corresponds toa class label y (y = 1...K, K = 10). For each digit, we have 600 training samples and 500 testingsamples. You can view these images at http://www.cs.nyu.edu/~roweis/data/usps_0.jpg, . . .,http://www.cs.nyu.edu/~roweis/data/usps_9.jpg.Please download the data from the website. Load the usps digital.mat file in usps digital.zip intoMATLAB. You will have four matrices:• tr X: training input matrix with the dimension 6000 × 256.• tr y: training label of the length 6000, each element is from 1 to 10.• te X: testing input matrix with the dimension 5000 × 256.• te y: testing label of the length 5000, each element is from 1 to 10.For those who do NOT want to use MATLAB, we also provide the text file for these four matrices inusps digital.zip.Note that if you want to view the image of a particular training/testing example in MATLAB, say the1000th training example, you may use the MATLAB command: imshow(reshape(tr X(1000,:),16,16))(a) [15pt] Use the gradient ascent algorithm to train a multi-class logistic regression classifier. Plot(1) the objective value (log-likelihood), (2) the training accuracy, and (3) the testing accuracyversus the number of iterations. Report your final testing accuracy, i.e. the fraction of test imagesthat are correctly classified.Note that you must choose a suitable learning rate (i.e. stepsize) of the gradient ascent algorithm.A hint is that your learning rate cannot be too large otherwise your objective will increase onlyfor the first few iterations.In addition, you need to choose a suitable stopping criterion. You might use the number ofiterations, the decrease of the objective value, or the maximum of the L2 norms of the gradientwith respect to each wk. Or you might watch the increase of the testing accuracy and stop theoptimization when the accuracy is stable.(b) [20pt] Now we add the regularization termλ2PK−1l=1kwlk22. For λ = 1, 10, 100, 1000, report thefinal testing accuracies.(c) [5pt] What can you conclude from the above experiment? (hint: the relationship between theregularization weight and the prediction
View Full Document