DOC PREVIEW
CMU CS 10701 - Homework

This preview shows page 1 out of 3 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 3 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 3 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Bayes Net [Carl, 40 points]Multi-class Logistic Regression with Applications to Handwritten Digit Recognition [Xi, 60 points]10-701 Machine Learning, Spring 2011: Homework 3Due: Feb 17th, at the beginning of classInstructions There are 2 questions on this assignment. The last question involves coding. Please submityour writeup as 2 separate sets of pages according to TAs, with your name and userid on each set.1 Bayes Net [Carl, 40 points]Figure 1: Graphical model for question 1.11. In this problem, you will study the behavior of the simple graphical model shown in figure 1, oftencalled a ‘collider’. Here, the probability of Z depends on two variables X and Y .(a) Write the factored distribution represented by this graph. That is, write the distribution as aproduct of factors each of which could be replaced by entries in the conditional probability tables(CPTs) in order to obtain the probability of an event. [4 points](b) Assume that X,Y , and Z are binary. Write the CPTs for a distribution such that P (X = 1) <P (X = 1|Y = 1, Z = 1) < P (X = 1|Z = 1). Make sure to include in your solution the values ofP (X = 1|Y = 1, Z = 1), and P (X = 1|Z = 1). [8 points]Figure 2: Graphical model for question 1.22. Let P (A, B, C, D, E, F, G, H) factorize according to the graph given in figure 2. Prove or disprovewhether the following independence properties hold. [3 points each](a) (A⊥H)(b) (B⊥G)(c) (B⊥G|E)1(d) (C⊥D|G, A)(e) (C⊥D)(f) (B⊥D|C)Figure 3: Graphical model for question 1.33. Let X = {X1, ..., Xn} have a distribution P (X) that factorizes according to the graph given in figure 3,and assume each Xiis binary. Design a polynomial-time algorithm which will compute the followingquantities. Use pseudocode for your answer. In your pseudocode, assume that the variables px1 andpx already exist, where px1 is the prior probability P (X1= 1), and px is an n-1 by 2 dimensionalarray, where px[i][j] is P (Xi+1= 1|Xi= j − 1), for j ∈ {1, 2}. For this problem, you may assumethat the numbers have arbitrary precision, so you do not need to worry about underflow.(a) Compute P (Xn= 1|X1= x1). Assume that x1 has already been initialized with the value ofX1∈ {0, 1}. [5 points](b) Compute P (X1= 1|Xn= xn). Assume that xn has already been initialized with the value ofXn∈ {0, 1}. [5 points]2 Multi-class Logistic Regression with Applications to Handwrit-ten Digit Recognition [Xi, 60 points]Recall the multi-class logistic regression model on page 6 of the lecture on Jan 27th. Assume that we haveK different classes and each input x is a d dimensional vector. The posterior probability is given by:P (Y = k|X = x) =exp(wTkx)1 +PK−1l=1exp(wTlx)for k = 1, . . . , K − 1P (Y = K|X = x) =11 +PK−1l=1exp(wTlx),where wTkdenotes the transpose of the wkvector and wTkx is the inner product of wkand x (i.e., wTkx =Piwkixi). To simplify, we ignore the constant term wk0in the logistic regression model.For ease of notation, we can replace the two above expressions for P (Y |X) by a single expression. We dothis by introducing a fixed, pseudo parameter vector: wK= 0. Now for any class label k, we simply writeP (Y = k|X = x) =exp(wTkx)1 +PK−1l=1exp(wTlx)for k = 1, . . . , K1. How many parameters do we need to estimate? What are these parameters? [5pt]2. Given n training samples: {(x1, y1), (x2, y2), . . . , (xn, yn)}, please write down explicitly the log likeli-hood function and simplify it as much as you can:L(w1, . . . , wK−1) =nXi=1ln P (Y = yi|X = xi).[5pt]3. Compute the gradient of L with respect to each wkand simply it. [5pt]24. Now we add the regularization termλ2PK−1l=1kwlk22and define a new objective function:f(w1, . . . , wK−1) = L(w1, . . . , wK−1) −λ2K−1Xl=1kwlk22.Compute the gradient of f with respect to each wk. [5pt]5. In this part, you are going to play with the “handwritten digit recognition” problem on the USPSdigit dataset. Each hand-written digital image is 16 by 16 pixels. If we treat the value of each pixelas a boolean feature (either 0 for black or 1 for white), then each example has 16 × 16 = 256 {0, 1}-valued features, and hence x has 256 dimension. Each digit (i.e. 1,2,3,4,5,6,7,8,9,0) corresponds toa class label y (y = 1...K, K = 10). For each digit, we have 600 training samples and 500 testingsamples. You can view these images at http://www.cs.nyu.edu/~roweis/data/usps_0.jpg, . . .,http://www.cs.nyu.edu/~roweis/data/usps_9.jpg.Please download the data from the website. Load the usps digital.mat file in usps digital.zip intoMATLAB. You will have four matrices:• tr X: training input matrix with the dimension 6000 × 256.• tr y: training label of the length 6000, each element is from 1 to 10.• te X: testing input matrix with the dimension 5000 × 256.• te y: testing label of the length 5000, each element is from 1 to 10.For those who do NOT want to use MATLAB, we also provide the text file for these four matrices inusps digital.zip.Note that if you want to view the image of a particular training/testing example in MATLAB, say the1000th training example, you may use the MATLAB command: imshow(reshape(tr X(1000,:),16,16))(a) [15pt] Use the gradient ascent algorithm to train a multi-class logistic regression classifier. Plot(1) the objective value (log-likelihood), (2) the training accuracy, and (3) the testing accuracyversus the number of iterations. Report your final testing accuracy, i.e. the fraction of test imagesthat are correctly classified.Note that you must choose a suitable learning rate (i.e. stepsize) of the gradient ascent algorithm.A hint is that your learning rate cannot be too large otherwise your objective will increase onlyfor the first few iterations.In addition, you need to choose a suitable stopping criterion. You might use the number ofiterations, the decrease of the objective value, or the maximum of the L2 norms of the gradientwith respect to each wk. Or you might watch the increase of the testing accuracy and stop theoptimization when the accuracy is stable.(b) [20pt] Now we add the regularization termλ2PK−1l=1kwlk22. For λ = 1, 10, 100, 1000, report thefinal testing accuracies.(c) [5pt] What can you conclude from the above experiment? (hint: the relationship between theregularization weight and the prediction


View Full Document

CMU CS 10701 - Homework

Documents in this Course
lecture

lecture

12 pages

lecture

lecture

17 pages

HMMs

HMMs

40 pages

lecture

lecture

15 pages

lecture

lecture

20 pages

Notes

Notes

10 pages

Notes

Notes

15 pages

Lecture

Lecture

22 pages

Lecture

Lecture

13 pages

Lecture

Lecture

24 pages

Lecture9

Lecture9

38 pages

lecture

lecture

26 pages

lecture

lecture

13 pages

Lecture

Lecture

5 pages

lecture

lecture

18 pages

lecture

lecture

22 pages

Boosting

Boosting

11 pages

lecture

lecture

16 pages

lecture

lecture

20 pages

Lecture

Lecture

20 pages

Lecture

Lecture

39 pages

Lecture

Lecture

14 pages

Lecture

Lecture

18 pages

Lecture

Lecture

13 pages

Exam

Exam

10 pages

Lecture

Lecture

27 pages

Lecture

Lecture

15 pages

Lecture

Lecture

24 pages

Lecture

Lecture

16 pages

Lecture

Lecture

23 pages

Lecture6

Lecture6

28 pages

Notes

Notes

34 pages

lecture

lecture

15 pages

Midterm

Midterm

11 pages

lecture

lecture

11 pages

lecture

lecture

23 pages

Boosting

Boosting

35 pages

Lecture

Lecture

49 pages

Lecture

Lecture

22 pages

Lecture

Lecture

16 pages

Lecture

Lecture

18 pages

Lecture

Lecture

35 pages

lecture

lecture

22 pages

lecture

lecture

24 pages

Midterm

Midterm

17 pages

exam

exam

15 pages

Lecture12

Lecture12

32 pages

lecture

lecture

19 pages

Lecture

Lecture

32 pages

boosting

boosting

11 pages

pca-mdps

pca-mdps

56 pages

bns

bns

45 pages

mdps

mdps

42 pages

svms

svms

10 pages

Notes

Notes

12 pages

lecture

lecture

42 pages

lecture

lecture

29 pages

lecture

lecture

15 pages

Lecture

Lecture

12 pages

Lecture

Lecture

24 pages

Lecture

Lecture

22 pages

Midterm

Midterm

5 pages

mdps-rl

mdps-rl

26 pages

Load more
Download Homework
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Homework and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Homework 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?