DOC PREVIEW
CMU CS 10701 - Homework

This preview shows page 1-2-3 out of 8 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

10701/15781 Machine Learning, Spring 2007: Homework 1Due: Wednesday, February 7, beginning of the classInstructionsThere are 4 questions on this assignment. The last question involves coding. Do not attach your code tothe writeup. Instead, copy your implementation to/afs/andrew.cmu.edu/course/10/701/Submit/your_andrew_id/HW1Refer to the webpage for policies regarding collaboration, due dates, and extensions. To write in thisdirectory, you need a kerberos instance for andrew, or you can log into, for example, unix.andrew.cmu.edu.1 Decision Trees [Andy, 20 points]1.1 Pruning Decision Trees with Chi-Square TestsOne method of pruning decision trees is to test the data involved in each split in the tree, to see how likely itis that we would see the observed data if the attribute being split and the target attribute were uncorrelated.A commonly used test is Pearson’s chi-square test, which is an example of a statistical hypothesis test. Inthis test, we hypothesize that the two attributes are uncorrelated, and test if the observed evidence stronglysupports rejecting this hypothesis. Sp ecifically, we calculate a certain test statistic which has a chi-squareddistribution if the data are uncorrelated, and reject the hypothesis that the variables are uncorrelated if thestatistic has a value unlikely to have b een generated by a chi-squared distribution.To perform the test, assume we have data for two variables A and B, and we want to test if A and B arecorrelated. Assume A can take on j different values, while B can take on k different values. Using the data,we calculate the test statistic χ2:χ2=jXa=1kXb=1[Cou nt(a, b) − n · p(a)p(b)]2n · p(a)p(b)Here Count(a, b) is the number of data points with values A = a and B = b. n is the total number ofdata points. p(a) is the fraction of data points with A = a, and p(b) is the fraction of data points withB = b.A parameter to the χ-square distribution is the degrees-of-freedom, which, if A can take on j differentvalues, while B can take on k different values, is:DOF = (j − 1)(k − 1)For example, if A and B can each take two different values, then D OF = (2 − 1)(2 − 1) = 1.We compute the value of χ2for the data, and compare it to a critical value, which is based on the degreesof freedom and the degree of error we are willing to accept. If the statistic exceeds the critical value, then wecan reject the hypothesis of independence. For example, with 1 degree of freedom, the statistic must exceed2.706 to be 90% sure that the variables are correlated.Consider the following set of training examples for the unknown target function < X1, X2>→ Y . Eachrow indicates the values observed, and how many times that set of values was observed. For example,(+, T, T ) was observed 3 times, while (−, T, T ) was never observed.1Y X1X2Count+ T T 3+ T F 4+ F T 4+ F F 1- T T 0- T F 1- F T 3- F F 51. [3 pts] What is the sample entropy H(Y ) for this training data?2. [3 pts] What is the information gain IG(X2) ≡ H(Y ) − H(Y |X2) for this sample of training data?3. [4 pts] Draw the decision tree that would be learned by ID3 (without postpruning) from this sampleof training data. Show your information gain calculations.4. [5 pts] Are the splits in the decision tree from the previous part statistically significant? That is, doesthe chi-square test with maxP Chance = 0.10 support the conclusion that the attributes in the splitsare correlated? (Hint: the critical value for the statistic will be 2.706).1.2 KL-divergence, Information Gain, and EntropyWhen we discussed learning decision trees in class, we chose the next attribute to split on by choosing theone with maximum information gain, which was defined in terms of entropy. To further our understandingof information gain, we will explore its connection to KL-divergence, an important concept in informationtheory and machine learning. For more on these concepts, refer to Section 1.6 in Bishop.The KL-divergence from a distribution p(x) to a distribution q(x) can be thought of as a distance measurefrom P to Q:KL(p||q) = −Xp(x) log2q(x)p(x)From an information theory perspective, the KL-divergence specifies the number of additional bits re-quired on average to transmit values of x if the values are distributed with respect to p(x) but we encodethem assuming the distribution q(x). If p(x) = q(x), then KL(p||q) = 0. Otherwise, KL(p||q) > 0. Thesmaller the KL-divergence, the more similar the two distributions.We can define information gain as the KL-divergence from the observed joint distribution of X and Y tothe distribution obtained if we assume independence.IG(x, y) ≡ KL (p(x, y)||p(x)p(y)) = −XxXyp(x, y) log2p(x)p(y)p(x, y)When the information gain is high, it indicates that adding a split to the decision tree will give a moreaccurate model.1. [5 pts] Show that this definition of information gain is equivalent to the one given in class. That is, showthat IG(x, y) = H[x] − H[x|y] = H[y] − H[y|x], starting from the definition in terms of KL-divergence.2 Regression [Brian, 25 points]2.1 Linear ModelsA billionaire near Seattle is trying to estimate the gross sales of different software packages. He has a fewdifferent functions that he would like to try to use for his model.2Unfortunately, he only has at his disposal a software package for linear regression. The linear regressionpackage takes as input a vector of responses (Y) and a matrix of features (X), where the entry Xi,jcorre-sponds to the i-th example and the j-th input for that example and Yiis the i-th response of the function.The linear regression package returns a vector of weights w that minimizes the sum of squared residualerrors. The j-th entry of the vector, wjis the weight applied to the j-th feature.The billionaire has a gross sales amount Giand a vector of characteristics Cifor each piece of softwarehis company has ever produced.Your job is to help him find the maximum likelihoo d parameter estimates for the functions he wants toemploy, or tell him when the ML estimate cannot be obtained using his linear regression software.Your answer should:• specify how the response and inputs (Yiand Xi,j) are calculated for the regression software package• specify how parameters α can be obtained from the values returned by the regression software packagew so that α is the maximum likelihood estimateOR:• provide your reasoning for why the software can not be employedExample. To help you understand how to use the software, the


View Full Document

CMU CS 10701 - Homework

Documents in this Course
lecture

lecture

12 pages

lecture

lecture

17 pages

HMMs

HMMs

40 pages

lecture

lecture

15 pages

lecture

lecture

20 pages

Notes

Notes

10 pages

Notes

Notes

15 pages

Lecture

Lecture

22 pages

Lecture

Lecture

13 pages

Lecture

Lecture

24 pages

Lecture9

Lecture9

38 pages

lecture

lecture

26 pages

lecture

lecture

13 pages

Lecture

Lecture

5 pages

lecture

lecture

18 pages

lecture

lecture

22 pages

Boosting

Boosting

11 pages

lecture

lecture

16 pages

lecture

lecture

20 pages

Lecture

Lecture

20 pages

Lecture

Lecture

39 pages

Lecture

Lecture

14 pages

Lecture

Lecture

18 pages

Lecture

Lecture

13 pages

Exam

Exam

10 pages

Lecture

Lecture

27 pages

Lecture

Lecture

15 pages

Lecture

Lecture

24 pages

Lecture

Lecture

16 pages

Lecture

Lecture

23 pages

Lecture6

Lecture6

28 pages

Notes

Notes

34 pages

lecture

lecture

15 pages

Midterm

Midterm

11 pages

lecture

lecture

11 pages

lecture

lecture

23 pages

Boosting

Boosting

35 pages

Lecture

Lecture

49 pages

Lecture

Lecture

22 pages

Lecture

Lecture

16 pages

Lecture

Lecture

18 pages

Lecture

Lecture

35 pages

lecture

lecture

22 pages

lecture

lecture

24 pages

Midterm

Midterm

17 pages

exam

exam

15 pages

Lecture12

Lecture12

32 pages

lecture

lecture

19 pages

Lecture

Lecture

32 pages

boosting

boosting

11 pages

pca-mdps

pca-mdps

56 pages

bns

bns

45 pages

mdps

mdps

42 pages

svms

svms

10 pages

Notes

Notes

12 pages

lecture

lecture

42 pages

lecture

lecture

29 pages

lecture

lecture

15 pages

Lecture

Lecture

12 pages

Lecture

Lecture

24 pages

Lecture

Lecture

22 pages

Midterm

Midterm

5 pages

mdps-rl

mdps-rl

26 pages

Load more
Download Homework
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Homework and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Homework 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?