10701 15781 Machine Learning Spring 2007 Homework 1 Due Wednesday February 7 beginning of the class Instructions There are 4 questions on this assignment The last question involves coding Do not attach your code to the writeup Instead copy your implementation to afs andrew cmu edu course 10 701 Submit your andrew id HW1 Refer to the webpage for policies regarding collaboration due dates and extensions To write in this directory you need a kerberos instance for andrew or you can log into for example unix andrew cmu edu 1 1 1 Decision Trees Andy 20 points Pruning Decision Trees with Chi Square Tests One method of pruning decision trees is to test the data involved in each split in the tree to see how likely it is that we would see the observed data if the attribute being split and the target attribute were uncorrelated A commonly used test is Pearson s chi square test which is an example of a statistical hypothesis test In this test we hypothesize that the two attributes are uncorrelated and test if the observed evidence strongly supports rejecting this hypothesis Specifically we calculate a certain test statistic which has a chi squared distribution if the data are uncorrelated and reject the hypothesis that the variables are uncorrelated if the statistic has a value unlikely to have been generated by a chi squared distribution To perform the test assume we have data for two variables A and B and we want to test if A and B are correlated Assume A can take on j different values while B can take on k different values Using the data we calculate the test statistic 2 2 j X k X Count a b n p a p b 2 n p a p b a 1 b 1 Here Count a b is the number of data points with values A a and B b n is the total number of data points p a is the fraction of data points with A a and p b is the fraction of data points with B b A parameter to the square distribution is the degrees of freedom which if A can take on j different values while B can take on k different values is DOF j 1 k 1 For example if A and B can each take two different values then DOF 2 1 2 1 1 We compute the value of 2 for the data and compare it to a critical value which is based on the degrees of freedom and the degree of error we are willing to accept If the statistic exceeds the critical value then we can reject the hypothesis of independence For example with 1 degree of freedom the statistic must exceed 2 706 to be 90 sure that the variables are correlated Consider the following set of training examples for the unknown target function X1 X2 Y Each row indicates the values observed and how many times that set of values was observed For example T T was observed 3 times while T T was never observed 1 Y X1 T T F F T T F F X2 T F T F T F T F Count 3 4 4 1 0 1 3 5 1 3 pts What is the sample entropy H Y for this training data 2 3 pts What is the information gain IG X2 H Y H Y X2 for this sample of training data 3 4 pts Draw the decision tree that would be learned by ID3 without postpruning from this sample of training data Show your information gain calculations 4 5 pts Are the splits in the decision tree from the previous part statistically significant That is does the chi square test with maxP Chance 0 10 support the conclusion that the attributes in the splits are correlated Hint the critical value for the statistic will be 2 706 1 2 KL divergence Information Gain and Entropy When we discussed learning decision trees in class we chose the next attribute to split on by choosing the one with maximum information gain which was defined in terms of entropy To further our understanding of information gain we will explore its connection to KL divergence an important concept in information theory and machine learning For more on these concepts refer to Section 1 6 in Bishop The KL divergence from a distribution p x to a distribution q x can be thought of as a distance measure from P to Q KL p q X p x log2 q x p x From an information theory perspective the KL divergence specifies the number of additional bits required on average to transmit values of x if the values are distributed with respect to p x but we encode them assuming the distribution q x If p x q x then KL p q 0 Otherwise KL p q 0 The smaller the KL divergence the more similar the two distributions We can define information gain as the KL divergence from the observed joint distribution of X and Y to the distribution obtained if we assume independence XX p x p y IG x y KL p x y p x p y p x y log2 p x y x y When the information gain is high it indicates that adding a split to the decision tree will give a more accurate model 1 5 pts Show that this definition of information gain is equivalent to the one given in class That is show that IG x y H x H x y H y H y x starting from the definition in terms of KL divergence 2 2 1 Regression Brian 25 points Linear Models A billionaire near Seattle is trying to estimate the gross sales of different software packages He has a few different functions that he would like to try to use for his model 2 Unfortunately he only has at his disposal a software package for linear regression The linear regression package takes as input a vector of responses Y and a matrix of features X where the entry Xi j corresponds to the i th example and the j th input for that example and Yi is the i th response of the function The linear regression package returns a vector of weights w that minimizes the sum of squared residual errors The j th entry of the vector wj is the weight applied to the j th feature The billionaire has a gross sales amount Gi and a vector of characteristics Ci for each piece of software his company has ever produced Your job is to help him find the maximum likelihood parameter estimates for the functions he wants to employ or tell him when the ML estimate cannot be obtained using his linear regression software Your answer should specify how the response and inputs Yi and Xi j are calculated for the regression software package specify how parameters can be obtained from the values returned by the regression software package w so that is the maximum likelihood estimate OR provide your reasoning for why the …
View Full Document