Decision Trees Aarti Singh Machine Learning 10 701 15 781 Oct 6 2010 Learning a good prediction rule Learn a mapping Best prediction rule Hypothesis space Function class Parametric classes Gaussian binomial etc Conditionally independent class densities Na ve Bayes Linear decision boundary Logistic regression Nonparametric class Histograms nearest neighbor kernel estimators Decision Trees Today Given training data find a hypothesis function in close to the best prediction rule that is 2 First What does a decision tree represent Given a decision tree how do we assign label to a test point 3 Decision Tree for Tax Fraud Detection Query Data Refund Yes No NO TaxInc 80K NO Married NO 80K YES Taxable Income Cheat No 80K Married 10 MarSt Single Divorced Refund Marital Status Each internal node test one feature Xi Each branch from a node selects one value for Xi Each leaf node predict Y 4 Decision Tree for Tax Fraud Detection Query Data Refund Yes No NO Refund Marital Status Taxable Income Cheat No 80K Married 10 MarSt Single Divorced TaxInc 80K NO Married NO 80K YES 5 Decision Tree for Tax Fraud Detection Query Data Refund Yes No NO Refund Marital Status Taxable Income Cheat No 80K Married 10 MarSt Single Divorced TaxInc 80K NO Married NO 80K YES 6 Decision Tree for Tax Fraud Detection Query Data Refund Yes No NO Refund Marital Status Taxable Income Cheat No 80K Married 10 10 MarSt Single Divorced TaxInc 80K NO Married NO 80K YES 7 Decision Tree for Tax Fraud Detection Query Data Refund Yes No NO Refund Marital Status Taxable Income Cheat No 80K Married 10 10 MarSt Single Divorced TaxInc 80K NO Married NO 80K YES 8 Decision Tree for Tax Fraud Detection Query Data Refund Yes No NO Refund Marital Status Taxable Income Cheat No 80K Married 10 10 MarSt Single Divorced TaxInc 80K NO Married NO 80K YES 9 Decision Tree for Tax Fraud Detection Query Data Refund Yes No NO Refund Marital Status Taxable Income Cheat No 80K Married 10 10 MarSt Single Divorced TaxInc 80K NO Married Assign Cheat to No NO 80K YES 10 Decision Tree more generally 1 1 0 0 1 1 1 0 1 1 0 1 Features can be discrete continuous or categorical Each internal node test some set of features Xi Each branch from a node selects a set of value for Xi Each leaf node predict Y 1 1 11 So far What does a decision tree represent Given a decision tree how do we assign label to a test point Now How do we learn a decision tree from training data What is the decision on each leaf 12 So far What does a decision tree represent Given a decision tree how do we assign label to a test point Now How do we learn a decision tree from training data What is the decision on each leaf 13 How to learn a decision tree Top down induction ID3 C4 5 CART Refund Yes No NO MarSt Single Divorced Married TaxInc 80K NO NO 80K YES 14 Which feature is best to split X1 X2 Y T T T T F F F F T F T F T F T F T T T T T F F F T Y 4 Ts 0 Fs Absolutely sure F Y 1 Ts 3 Fs Kind of sure F T Y 3 Ts 1 Fs Kind of sure Y 2 Ts 2 Fs Absolutely unsure Good split if we are more certain about classification after split Uniform distribution of labels is bad 15 Which feature is best to split Pick the attribute feature which yields maximum information gain H Y entropy of Y H Y Xi conditional entropy of Y 16 Entropy Entropy of a random variable Y Y Bernoulli p Uniform Max entropy Entropy H Y More uncertainty more entropy Deterministic Zero entropy p Information Theory interpretation H Y is the expected number of bits needed to encode a randomly drawn value of Y under most efficient code 17 Andrew Moore s Entropy in a Nutshell Low Entropy High Entropy the values locations of soup sampled entirely from within the soup bowl the values locations of soup unpredictable almost uniformly sampled throughout our dining room 18 Information Gain Advantage of attribute decrease in uncertainty Entropy of Y before split Entropy of Y after splitting based on Xi Weight by probability of following each branch Information gain is difference Max Information gain min conditional entropy 19 Information Gain X1 T T X2 T F Y T T T T F F F F T F T F T F T T T F F F T Y 4 Ts 0 Fs F Y 1 Ts 3 Fs T Y 3 Ts 1 Fs F Y 2 Ts 2 Fs 0 20 Which feature is best to split Pick the attribute feature which yields maximum information gain H Y entropy of Y H Y Xi conditional entropy of Y Feature which yields maximum reduction in entropy provides maximum information about Y 21 Expressiveness of Decision Trees Decision trees can express any function of the input features E g for Boolean functions truth table row path to leaf There is a decision tree which perfectly classifies a training set with one path to leaf for each example But it won t generalize well to new examples prefer to find more compact decision trees 22 Decision Trees Overfitting One training example per leaf overfits need compact pruned decision tree 23 Bias Variance Tradeoff average classifier Classifiers based on different training data coarse partition bias large variance small fine partition bias small variance large Ideal classifier 24 When to Stop Many strategies for picking simpler trees Pre pruning Fixed depth Fixed number of leaves Post pruning Refund Yes No MarSt Single Divorced Chi square test Convert decision tree to a set of rules Eliminate variable values in rules which are independent of label using chi square test for independence Simplify rule set by eliminating unnecessary rules Married NO Information Criteria MDL Minimum Description Length 25 Information Criteria Penalize complex models by introducing cost log likelihood cost regression classification penalize trees with more leaves 26 Information Criteria MDL Penalize complex models based on their information content MDL Minimum Description Length bits needed to describe f description length Example Binary Decision trees k leaves 2k 1 nodes 2k 1 bits to encode tree structure k bits to encode label of each leaf 0 1 5 leaves 9 bits to encode structure So far What does a decision tree represent Given a decision tree how do we assign label to a test point Now How do we learn a decision tree from training data What is the decision on each leaf 28 How to assign label to each leaf Classification Majority vote Regression 29 How to assign label to each leaf Classification Majority vote Regression Constant Linear Poly fit 30 Regression trees Num Children 2 2 Average fit a constant using training data at the leaves 31 Connection between nearest neighbor histogram classifiers and decision trees 32 Local prediction
View Full Document