Decision Trees many possible refs e g Mitchell Chapter 3 Boosting Linked from class website Schapire 01 Decision Trees Boosting Machine Learning 10701 15781 Carlos Guestrin Carnegie Mellon University February 6th 2006 2006 Carlos Guestrin 1 Announcements Recitations stay on Thursdays 5 6 30pm in Wean 5409 This week Decision Trees and Boosting Pittsburgh won the Super Bowl 2006 Carlos Guestrin 2 Linear separability A dataset is linearly separable iff a separating hyperplane w such that w0 i wi xi 0 if x x1 xn is a positive example w0 i wi xi 0 if x x1 xn is a negative example 2006 Carlos Guestrin 3 Not linearly separable data Some datasets are not linearly separable 2006 Carlos Guestrin 4 Addressing non linearly separable data Option 1 non linear features Choose non linear features e g Typical linear features w0 i wi xi Example of non linear features Degree 2 polynomials w0 i wi xi ij wij xi xj Classifier hw x still linear in parameters w Usually easy to learn closed form or convex concave optimization Data is linearly separable in higher dimensional spaces More discussion later this semester 2006 Carlos Guestrin 5 Addressing non linearly separable data Option 2 non linear classifier Choose a classifier hw x that is non linear in parameters w e g Decision trees neural networks nearest neighbor More general than linear classifiers But can often be harder to learn non convex concave optimization required But but often very useful BTW Later this semester we ll see that these options are not that different 2006 Carlos Guestrin 6 A small dataset Miles Per Gallon mpg Suppose we want to predict MPG good bad bad bad bad bad bad bad bad good bad good bad good good bad good bad cylinders displacement horsepower 4 6 4 8 6 4 4 8 8 8 8 4 6 4 4 8 4 5 low medium medium high medium low low high high high high low medium medium low high low medium low medium medium high medium medium medium high high medium high low medium low low high medium medium weight acceleration modelyear maker low medium medium high medium low low high high high high low medium low medium high low medium high medium low low medium medium low low low high low low high low high low medium medium 75to78 70to74 75to78 70to74 70to74 70to74 70to74 75to78 70to74 79to83 75to78 79to83 75to78 79to83 79to83 70to74 75to78 75to78 asia america europe america america asia asia america america america america america america america america america europe europe 40 Records From the UCI repository thanks to Ross Quinlan 2006 Carlos Guestrin 7 A Decision Stump 2006 Carlos Guestrin 8 Recursion Step Records in which cylinders 4 Take the Original Dataset And partition it according to the value of the attribute we split on Records in which cylinders 5 Records in which cylinders 6 Records in which cylinders 8 2006 Carlos Guestrin 9 Recursion Step Build tree from These records Records in which cylinders 4 Build tree from These records Records in which cylinders 5 Build tree from These records Records in which cylinders 6 2006 Carlos Guestrin Build tree from These records Records in which cylinders 8 10 Second level of tree Recursively build a tree from the seven records in which there are four cylinders and the maker was based in Asia 2006 Carlos Guestrin Similar recursion in the other cases 11 The final tree 2006 Carlos Guestrin 12 Classification of a new example Classifying a test example traverse tree and report leaf label 2006 Carlos Guestrin 13 Are all decision trees equal Many trees can represent the same concept But not all trees will have the same size e g A B A C A and B or not A and C 2006 Carlos Guestrin 14 Learning decision trees is hard Learning the simplest smallest decision tree is an NP complete problem Hyafil Rivest 76 Resort to a greedy heuristic Start from empty decision tree Split on next best attribute feature Recurse 2006 Carlos Guestrin 15 Choosing a good attribute 2006 Carlos Guestrin X1 T T T T F F F F X2 T F T F T F T F Y T T T T T F F F 16 Measuring uncertainty Good split if we are more certain about classification after split Deterministic good all true or all false Uniform distribution bad P Y A 1 2 P Y B 1 4 P Y C 1 8 P Y D 1 8 P Y A 1 4 P Y B 1 4 P Y C 1 4 P Y D 1 4 2006 Carlos Guestrin 17 Entropy Entropy H X of a random variable Y More uncertainty more entropy Information Theory interpretation H Y is the expected number of bits needed to encode a randomly drawn value of Y under most efficient code 2006 Carlos Guestrin 18 Andrew Moore s Entropy in a nutshell Low Entropy High Entropy 2006 Carlos Guestrin 19 Andrew Moore s Entropy in a nutshell Low Entropy High Entropy the values locations of soup sampled entirely from within the soup bowl 2006 Carlos Guestrin the values locations of soup unpredictable almost uniformly sampled throughout our dining room 20 X1 T T Advantage of attribute decrease in uncertainty T Entropy of Y before you split T F Entropy after split F Weight by probability of following each branch i e Information gain X2 T F T F T F Y T T T T T F normalized number of records Information gain is difference 2006 Carlos Guestrin 21 Learning decision trees Start from empty decision tree Split on next best attribute feature Use for example information gain to select attribute Split on Recurse 2006 Carlos Guestrin 22 Information Gain Example 2006 Carlos Guestrin 23 Suppose we want to predict MPG Look at all the information gains 2006 Carlos Guestrin 24 A Decision Stump 2006 Carlos Guestrin 25 Base Case One Don t split a node if all matching records have the same output value 2006 Carlos Guestrin 26 Base Case Two Don t split a node if none of the attributes can create multiple nonempty children 2006 Carlos Guestrin 27 Base Case Two No attributes can distinguish 2006 Carlos Guestrin 28 Base Cases Base Case One If all records in current data subset have the same output then don t recurse Base Case Two If all records have exactly the same set of input attributes then don t recurse 2006 Carlos Guestrin 29 Base Cases An idea Base Case One If all records in current data subset have the same output then don t recurse Base Case Two If all records have exactly the same set of input attributes then don t recurse Proposed Base Case 3 If all attributes have zero information gain then don t recurse Is this a good idea 2006 Carlos Guestrin 30 The problem with Base Case 3 a b 0 0 1 1 y 0 1 0 1 0 1 1 0 y a XOR b The information gains The resulting decision tree 2006 Carlos Guestrin 31 If we omit Base Case 3 a b 0 0 …
View Full Document