10701/15781 Machine Learning, Spring 2007: Homework 1Due: Wednesday, February 7, beginning of the classInstructionsThere are 4 questions on this assignment. The last question involves coding. Do not attach your code tothe writeup. Instead, copy your implementation to/afs/andrew.cmu.edu/course/10/701/Submit/your_andrew_id/HW1Refer to the webpage for policies regarding collaboration, due dates, and extensions. To write in thisdirectory, you need a kerberos instance for andrew, or you can log into, for example, unix.andrew.cmu.edu.1 Decision Trees [Andy, 20 points]1.1 Pruning Decision Trees with Chi-Square TestsOne method of pruning decision trees is to test the data involved in each split in the tree, to see how likely itis that we would see the observed data if the attribute being split and the target attribute were uncorrelated.A commonly used test is Pearson’s chi-square test, which is an example of a statistical hypothesis test. Inthis test, we hypothesize that the two attributes are uncorrelated, and test if the observed evidence stronglysupports rejecting this hypothesis. Sp ecifically, we calculate a certain test statistic which has a chi-squareddistribution if the data are uncorrelated, and reject the hypothesis that the variables are uncorrelated if thestatistic has a value unlikely to have b een generated by a chi-squared distribution.To perform the test, assume we have data for two variables A and B, and we want to test if A and B arecorrelated. Assume A can take on j different values, while B can take on k different values. Using the data,we calculate the test statistic χ2:χ2=jXa=1kXb=1[Cou nt(a, b) − n · p(a)p(b)]2n · p(a)p(b)Here Count(a, b) is the number of data points with values A = a and B = b. n is the total number ofdata points. p(a) is the fraction of data points with A = a, and p(b) is the fraction of data points withB = b.A parameter to the χ-square distribution is the degrees-of-freedom, which, if A can take on j differentvalues, while B can take on k different values, is:DOF = (j − 1)(k − 1)For example, if A and B can each take two different values, then D OF = (2 − 1)(2 − 1) = 1.We compute the value of χ2for the data, and compare it to a critical value, which is based on the degreesof freedom and the degree of error we are willing to accept. If the statistic exceeds the critical value, then wecan reject the hypothesis of independence. For example, with 1 degree of freedom, the statistic must exceed2.706 to be 90% sure that the variables are correlated.Consider the following set of training examples for the unknown target function < X1, X2>→ Y . Eachrow indicates the values observed, and how many times that set of values was observed. For example,(+, T, T ) was observed 3 times, while (−, T, T ) was never observed.1Y X1X2Count+ T T 3+ T F 4+ F T 4+ F F 1- T T 0- T F 1- F T 3- F F 51. [3 pts] What is the sample entropy H(Y ) for this training data?2. [3 pts] What is the information gain IG(X2) ≡ H(Y ) − H(Y |X2) for this sample of training data?3. [4 pts] Draw the decision tree that would be learned by ID3 (without postpruning) from this sampleof training data. Show your information gain calculations.4. [5 pts] Are the splits in the decision tree from the previous part statistically significant? That is, doesthe chi-square test with maxP Chance = 0.10 support the conclusion that the attributes in the splitsare correlated? (Hint: the critical value for the statistic will be 2.706).1.2 KL-divergence, Information Gain, and EntropyWhen we discussed learning decision trees in class, we chose the next attribute to split on by choosing theone with maximum information gain, which was defined in terms of entropy. To further our understandingof information gain, we will explore its connection to KL-divergence, an important concept in informationtheory and machine learning. For more on these concepts, refer to Section 1.6 in Bishop.The KL-divergence from a distribution p(x) to a distribution q(x) can be thought of as a distance measurefrom P to Q:KL(p||q) = −Xp(x) log2q(x)p(x)From an information theory perspective, the KL-divergence specifies the number of additional bits re-quired on average to transmit values of x if the values are distributed with respect to p(x) but we encodethem assuming the distribution q(x). If p(x) = q(x), then KL(p||q) = 0. Otherwise, KL(p||q) > 0. Thesmaller the KL-divergence, the more similar the two distributions.We can define information gain as the KL-divergence from the observed joint distribution of X and Y tothe distribution obtained if we assume independence.IG(x, y) ≡ KL (p(x, y)||p(x)p(y)) = −XxXyp(x, y) log2p(x)p(y)p(x, y)When the information gain is high, it indicates that adding a split to the decision tree will give a moreaccurate model.1. [5 pts] Show that this definition of information gain is equivalent to the one given in class. That is, showthat IG(x, y) = H[x] − H[x|y] = H[y] − H[y|x], starting from the definition in terms of KL-divergence.2 Regression [Brian, 25 points]2.1 Linear ModelsA billionaire near Seattle is trying to estimate the gross sales of different software packages. He has a fewdifferent functions that he would like to try to use for his model.2Unfortunately, he only has at his disposal a software package for linear regression. The linear regressionpackage takes as input a vector of responses (Y) and a matrix of features (X), where the entry Xi,jcorre-sponds to the i-th example and the j-th input for that example and Yiis the i-th response of the function.The linear regression package returns a vector of weights w that minimizes the sum of squared residualerrors. The j-th entry of the vector, wjis the weight applied to the j-th feature.The billionaire has a gross sales amount Giand a vector of characteristics Cifor each piece of softwarehis company has ever produced.Your job is to help him find the maximum likelihoo d parameter estimates for the functions he wants toemploy, or tell him when the ML estimate cannot be obtained using his linear regression software.Your answer should:• specify how the response and inputs (Yiand Xi,j) are calculated for the regression software package• specify how parameters α can be obtained from the values returned by the regression software packagew so that α is the maximum likelihood estimateOR:• provide your reasoning for why the software can not be employedExample. To help you understand how to use the software, the
View Full Document