DOC PREVIEW
CMU CS 10701 - Homework

This preview shows page 1-2-3 out of 8 pages.

Save
View full document
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

10701 15781 Machine Learning Spring 2007 Homework 1 Due Wednesday February 7 beginning of the class Instructions There are 4 questions on this assignment The last question involves coding Do not attach your code to the writeup Instead copy your implementation to afs andrew cmu edu course 10 701 Submit your andrew id HW1 Refer to the webpage for policies regarding collaboration due dates and extensions To write in this directory you need a kerberos instance for andrew or you can log into for example unix andrew cmu edu 1 1 1 Decision Trees Andy 20 points Pruning Decision Trees with Chi Square Tests One method of pruning decision trees is to test the data involved in each split in the tree to see how likely it is that we would see the observed data if the attribute being split and the target attribute were uncorrelated A commonly used test is Pearson s chi square test which is an example of a statistical hypothesis test In this test we hypothesize that the two attributes are uncorrelated and test if the observed evidence strongly supports rejecting this hypothesis Specifically we calculate a certain test statistic which has a chi squared distribution if the data are uncorrelated and reject the hypothesis that the variables are uncorrelated if the statistic has a value unlikely to have been generated by a chi squared distribution To perform the test assume we have data for two variables A and B and we want to test if A and B are correlated Assume A can take on j different values while B can take on k different values Using the data we calculate the test statistic 2 2 j X k X Count a b n p a p b 2 n p a p b a 1 b 1 Here Count a b is the number of data points with values A a and B b n is the total number of data points p a is the fraction of data points with A a and p b is the fraction of data points with B b A parameter to the square distribution is the degrees of freedom which if A can take on j different values while B can take on k different values is DOF j 1 k 1 For example if A and B can each take two different values then DOF 2 1 2 1 1 We compute the value of 2 for the data and compare it to a critical value which is based on the degrees of freedom and the degree of error we are willing to accept If the statistic exceeds the critical value then we can reject the hypothesis of independence For example with 1 degree of freedom the statistic must exceed 2 706 to be 90 sure that the variables are correlated Consider the following set of training examples for the unknown target function X1 X2 Y Each row indicates the values observed and how many times that set of values was observed For example T T was observed 3 times while T T was never observed 1 Y X1 T T F F T T F F X2 T F T F T F T F Count 3 4 4 1 0 1 3 5 1 3 pts What is the sample entropy H Y for this training data 2 3 pts What is the information gain IG X2 H Y H Y X2 for this sample of training data 3 4 pts Draw the decision tree that would be learned by ID3 without postpruning from this sample of training data Show your information gain calculations 4 5 pts Are the splits in the decision tree from the previous part statistically significant That is does the chi square test with maxP Chance 0 10 support the conclusion that the attributes in the splits are correlated Hint the critical value for the statistic will be 2 706 1 2 KL divergence Information Gain and Entropy When we discussed learning decision trees in class we chose the next attribute to split on by choosing the one with maximum information gain which was defined in terms of entropy To further our understanding of information gain we will explore its connection to KL divergence an important concept in information theory and machine learning For more on these concepts refer to Section 1 6 in Bishop The KL divergence from a distribution p x to a distribution q x can be thought of as a distance measure from P to Q KL p q X p x log2 q x p x From an information theory perspective the KL divergence specifies the number of additional bits required on average to transmit values of x if the values are distributed with respect to p x but we encode them assuming the distribution q x If p x q x then KL p q 0 Otherwise KL p q 0 The smaller the KL divergence the more similar the two distributions We can define information gain as the KL divergence from the observed joint distribution of X and Y to the distribution obtained if we assume independence XX p x p y IG x y KL p x y p x p y p x y log2 p x y x y When the information gain is high it indicates that adding a split to the decision tree will give a more accurate model 1 5 pts Show that this definition of information gain is equivalent to the one given in class That is show that IG x y H x H x y H y H y x starting from the definition in terms of KL divergence 2 2 1 Regression Brian 25 points Linear Models A billionaire near Seattle is trying to estimate the gross sales of different software packages He has a few different functions that he would like to try to use for his model 2 Unfortunately he only has at his disposal a software package for linear regression The linear regression package takes as input a vector of responses Y and a matrix of features X where the entry Xi j corresponds to the i th example and the j th input for that example and Yi is the i th response of the function The linear regression package returns a vector of weights w that minimizes the sum of squared residual errors The j th entry of the vector wj is the weight applied to the j th feature The billionaire has a gross sales amount Gi and a vector of characteristics Ci for each piece of software his company has ever produced Your job is to help him find the maximum likelihood parameter estimates for the functions he wants to employ or tell him when the ML estimate cannot be obtained using his linear regression software Your answer should specify how the response and inputs Yi and Xi j are calculated for the regression software package specify how parameters can be obtained from the values returned by the regression software package w so that is the maximum likelihood estimate OR provide your reasoning for why the …


View Full Document

CMU CS 10701 - Homework

Documents in this Course
lecture

lecture

12 pages

lecture

lecture

17 pages

HMMs

HMMs

40 pages

lecture

lecture

15 pages

lecture

lecture

20 pages

Notes

Notes

10 pages

Notes

Notes

15 pages

Lecture

Lecture

22 pages

Lecture

Lecture

13 pages

Lecture

Lecture

24 pages

Lecture9

Lecture9

38 pages

lecture

lecture

26 pages

lecture

lecture

13 pages

Lecture

Lecture

5 pages

lecture

lecture

18 pages

lecture

lecture

22 pages

Boosting

Boosting

11 pages

lecture

lecture

16 pages

lecture

lecture

20 pages

Lecture

Lecture

20 pages

Lecture

Lecture

39 pages

Lecture

Lecture

14 pages

Lecture

Lecture

18 pages

Lecture

Lecture

13 pages

Exam

Exam

10 pages

Lecture

Lecture

27 pages

Lecture

Lecture

15 pages

Lecture

Lecture

24 pages

Lecture

Lecture

16 pages

Lecture

Lecture

23 pages

Lecture6

Lecture6

28 pages

Notes

Notes

34 pages

lecture

lecture

15 pages

Midterm

Midterm

11 pages

lecture

lecture

11 pages

lecture

lecture

23 pages

Boosting

Boosting

35 pages

Lecture

Lecture

49 pages

Lecture

Lecture

22 pages

Lecture

Lecture

16 pages

Lecture

Lecture

18 pages

Lecture

Lecture

35 pages

lecture

lecture

22 pages

lecture

lecture

24 pages

Midterm

Midterm

17 pages

exam

exam

15 pages

Lecture12

Lecture12

32 pages

lecture

lecture

19 pages

Lecture

Lecture

32 pages

boosting

boosting

11 pages

pca-mdps

pca-mdps

56 pages

bns

bns

45 pages

mdps

mdps

42 pages

svms

svms

10 pages

Notes

Notes

12 pages

lecture

lecture

42 pages

lecture

lecture

29 pages

lecture

lecture

15 pages

Lecture

Lecture

12 pages

Lecture

Lecture

24 pages

Lecture

Lecture

22 pages

Midterm

Midterm

5 pages

mdps-rl

mdps-rl

26 pages

Load more
Download Homework
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Homework and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Homework and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?