Decision Trees Machine Learning 10701 15781 Carlos Guestrin Carnegie Mellon University February 5th 2007 2005 2007 Carlos Guestrin 1 Linear separability A dataset is linearly separable iff 9 a separating hyperplane 9 w such that w0 i wi xi 0 if x x1 xn is a positive example w0 i wi xi 0 if x x1 xn is a negative example 2005 2007 Carlos Guestrin 2 Not linearly separable data Some datasets are not linearly separable 2005 2007 Carlos Guestrin 3 Addressing non linearly separable data Option 1 non linear features Choose non linear features e g Typical linear features w0 i wi xi Example of non linear features Degree 2 polynomials w0 i wi xi ij wij xi xj Classifier hw x still linear in parameters w As easy to learn Data is linearly separable in higher dimensional spaces More discussion later this semester 2005 2007 Carlos Guestrin 4 Addressing non linearly separable data Option 2 non linear classifier Choose a classifier hw x that is non linear in parameters w e g Decision trees neural networks nearest neighbor More general than linear classifiers But can often be harder to learn non convex concave optimization required But but often very useful BTW Later this semester we ll see that these options are not that different 2005 2007 Carlos Guestrin 5 A small dataset Miles Per Gallon mpg Suppose we want to predict MPG good bad bad bad bad bad bad bad bad good bad good bad good good bad good bad cylinders displacement horsepower 4 6 4 8 6 4 4 8 8 8 8 4 6 4 4 8 4 5 low medium medium high medium low low high high high high low medium medium low high low medium low medium medium high medium medium medium high high medium high low medium low low high medium medium weight acceleration modelyear maker low medium medium high medium low low high high high high low medium low medium high low medium high medium low low medium medium low low low high low low high low high low medium medium 75to78 70to74 75to78 70to74 70to74 70to74 70to74 75to78 70to74 79to83 75to78 79to83 75to78 79to83 79to83 70to74 75to78 75to78 asia america europe america america asia asia america america america america america america america america america europe europe 40 Records From the UCI repository thanks to Ross Quinlan 2005 2007 Carlos Guestrin 6 A Decision Stump 2005 2007 Carlos Guestrin 7 Recursion Step Records in which cylinders 4 Take the Original Dataset And partition it according to the value of the attribute we split on Records in which cylinders 5 Records in which cylinders 6 Records in which cylinders 8 2005 2007 Carlos Guestrin 8 Recursion Step Build tree from These records Records in which cylinders 4 Build tree from These records Records in which cylinders 5 Build tree from These records Records in which cylinders 6 2005 2007 Carlos Guestrin Build tree from These records Records in which cylinders 8 9 Second level of tree Recursively build a tree from the seven records in which there are four cylinders and the maker was based in Asia 2005 2007 Carlos Guestrin Similar recursion in the other cases 10 The final tree 2005 2007 Carlos Guestrin 11 Classification of a new example Classifying a test example traverse tree and report leaf label 2005 2007 Carlos Guestrin 12 Are all decision trees equal Many trees can represent the same concept But not all trees will have the same size e g A B A C A and B or not A and C 2005 2007 Carlos Guestrin 13 Learning decision trees is hard Learning the simplest smallest decision tree is an NP complete problem Hyafil Rivest 76 Resort to a greedy heuristic Start from empty decision tree Split on next best attribute feature Recurse 2005 2007 Carlos Guestrin 14 Choosing a good attribute 2005 2007 Carlos Guestrin X1 T T T T F F F F X2 T F T F T F T F Y T T T T T F F F 15 Measuring uncertainty Good split if we are more certain about classification after split Deterministic good all true or all false Uniform distribution bad P Y A 1 2 P Y B 1 4 P Y C 1 8 P Y D 1 8 P Y A 1 4 P Y B 1 4 P Y C 1 4 P Y D 1 4 2005 2007 Carlos Guestrin 16 Entropy Entropy H X of a random variable Y More uncertainty more entropy Information Theory interpretation H Y is the expected number of bits needed to encode a randomly drawn value of Y under most efficient code 2005 2007 Carlos Guestrin 17 Andrew Moore s Entropy in a nutshell Low Entropy High Entropy 2005 2007 Carlos Guestrin 18 Andrew Moore s Entropy in a nutshell Low Entropy High Entropy the values locations of soup sampled entirely from within the soup bowl 2005 2007 Carlos Guestrin the values locations of soup unpredictable almost uniformly sampled throughout our dining room 19 X1 T T Advantage of attribute decrease in uncertainty T Entropy of Y before you split T F Entropy after split F Weight by probability of following each branch i e Information gain X2 T F T F T F Y T T T T T F normalized number of records Information gain is difference 2005 2007 Carlos Guestrin 20 Learning decision trees Start from empty decision tree Split on next best attribute feature Use for example information gain to select attribute Split on Recurse 2005 2007 Carlos Guestrin 21 Suppose we want to predict MPG Look at all the information gains 2005 2007 Carlos Guestrin 22 A Decision Stump 2005 2007 Carlos Guestrin 23 Base Case One Don t split a node if all matching records have the same output value 2005 2007 Carlos Guestrin 24 Base Case Two Don t split a node if none of the attributes can create multiple nonempty children 2005 2007 Carlos Guestrin 25 Base Case Two No attributes can distinguish 2005 2007 Carlos Guestrin 26 Base Cases Base Case One If all records in current data subset have the same output then don t recurse Base Case Two If all records have exactly the same set of input attributes then don t recurse 2005 2007 Carlos Guestrin 27 Base Cases An idea Base Case One If all records in current data subset have the same output then don t recurse Base Case Two If all records have exactly the same set of input attributes then don t recurse Proposed Base Case 3 If all attributes have zero information gain then don t recurse Is this a good idea 2005 2007 Carlos Guestrin 28 The problem with Base Case 3 a b 0 0 1 1 y 0 1 0 1 0 1 1 0 y a XOR b The information gains The resulting decision tree 2005 2007 Carlos Guestrin 29 If we omit Base Case 3 a b 0 0 1 1 y a XOR b y 0 1 0 1 0 1 1 0 The resulting decision tree 2005 2007 Carlos Guestrin 30 Basic Decision …
View Full Document