©2005-2007 Carlos Guestrin1Decision TreesMachine Learning – 10701/15781Carlos GuestrinCarnegie Mellon UniversityFebruary 5th, 2007©2005-2007 Carlos Guestrin2Linear separability A dataset is linearly separable iff 9 aseparating hyperplane: 9 w, such that: w0 + ∑i wi xi > 0; if x={x1,…,xn} is a positive example w0 + ∑i wi xi < 0; if x={x1,…,xn} is a negative example©2005-2007 Carlos Guestrin3Not linearly separable data Some datasets are not linearly separable!©2005-2007 Carlos Guestrin4Addressing non-linearly separabledata – Option 1, non-linear features Choose non-linear features, e.g., Typical linear features: w0 + ∑i wi xi Example of non-linear features: Degree 2 polynomials, w0 + ∑i wi xi + ∑ij wij xi xj Classifier hw(x) still linear in parameters w As easy to learn Data is linearly separable in higher dimensional spaces More discussion later this semester©2005-2007 Carlos Guestrin5Addressing non-linearly separabledata – Option 2, non-linear classifier Choose a classifier hw(x) that is non-linear in parameters w, e.g., Decision trees, neural networks, nearest neighbor,… More general than linear classifiers But, can often be harder to learn (non-convex/concaveoptimization required) But, but, often very useful (BTW. Later this semester, we’ll see that these options are notthat different)©2005-2007 Carlos Guestrin6A small dataset: Miles Per GallonFrom the UCI repository (thanks to Ross Quinlan)40 Recordsmpg cylinders displacement horsepower weight acceleration modelyear makergood 4 low low low high 75to78 asiabad 6 medium medium medium medium 70to74 americabad 4 medium medium medium low 75to78 europebad 8 high high high low 70t o74 americabad 6 medium medium medium medium 70to74 americabad 4 low medium low medium 70to74 asiabad 4 low medium low low 70t o74 asiabad 8 high high high low 75t o78 america: : : : : : : :: : : : : : : :: : : : : : : :bad 8 high high high low 70t o74 americagood 8 high medium high high 79to83 americabad 8 high high high low 75t o78 americagood 4 low low low low 79t o83 americabad 6 medium medium medium high 75to78 americagood 4 medium low low low 79t o83 americagood 4 low low medium high 79to83 americabad 8 high high high low 70t o74 americagood 4 low medium low medium 75to78 europebad 5 medium medium medium medium 75to78 europeSuppose we want topredict MPG©2005-2007 Carlos Guestrin7A Decision Stump©2005-2007 Carlos Guestrin8Recursion StepTake theOriginalDataset..And partition itaccordingto the value ofthe attributewe split onRecordsin whichcylinders= 4Recordsin whichcylinders= 5Recordsin whichcylinders= 6Recordsin whichcylinders= 8©2005-2007 Carlos Guestrin9Recursion StepRecords inwhichcylinders = 4Records inwhichcylinders = 5Records inwhichcylinders = 6Records inwhichcylinders = 8Build tree fromThese records..Build tree fromThese records..Build tree fromThese records..Build tree fromThese records..©2005-2007 Carlos Guestrin10Second level of treeRecursively build a tree from the sevenrecords in which there are four cylinders andthe maker was based in Asia(Similar recursion in theother cases)©2005-2007 Carlos Guestrin11The final tree©2005-2007 Carlos Guestrin12Classification of a new example Classifying a testexample – traverse treeand report leaf label©2005-2007 Carlos Guestrin13Are all decision trees equal? Many trees can represent the same concept But, not all trees will have the same size! e.g., φ = AÆB ∨ ¬ AÆC ((A and B) or (not A and C))©2005-2007 Carlos Guestrin14Learning decision trees is hard!!! Learning the simplest (smallest) decision tree isan NP-complete problem [Hyafil & Rivest ’76] Resort to a greedy heuristic: Start from empty decision tree Split on next best attribute (feature) Recurse©2005-2007 Carlos Guestrin15Choosing a good attributeTTTTFTFFFFTFFFFTTFTFTTTTYX2X1©2005-2007 Carlos Guestrin16Measuring uncertainty Good split if we are more certain aboutclassification after split Deterministic good (all true or all false) Uniform distribution badP(Y=C) = 1/4P(Y=B) = 1/4 P(Y=D) = 1/4P(Y=A) = 1/4P(Y=C) = 1/8P(Y=B) = 1/4 P(Y=D) = 1/8P(Y=A) = 1/2©2005-2007 Carlos Guestrin17EntropyEntropy H(X) of a random variable YMore uncertainty, more entropy!Information Theory interpretation: H(Y) is the expected number of bits neededto encode a randomly drawn value of Y (under most efficient code)©2005-2007 Carlos Guestrin18Andrew Moore’s Entropy in a nutshellLow Entropy High Entropy©2005-2007 Carlos Guestrin19Low Entropy High Entropy..the values (locations ofsoup) unpredictable...almost uniformly sampledthroughout our dining room..the values (locationsof soup) sampledentirely from withinthe soup bowlAndrew Moore’s Entropy in a nutshell©2005-2007 Carlos Guestrin20Information gain Advantage of attribute – decrease in uncertainty Entropy of Y before you split Entropy after split Weight by probability of following each branch, i.e.,normalized number of records Information gain is differenceTTTTFTFFFTTFTFTTTTYX2X1©2005-2007 Carlos Guestrin21Learning decision trees Start from empty decision tree Split on next best attribute (feature) Use, for example, information gain to select attribute Split on Recurse©2005-2007 Carlos Guestrin22Look at all theinformationgains…Suppose we want topredict MPG©2005-2007 Carlos Guestrin23A Decision Stump©2005-2007 Carlos Guestrin24Base CaseOneDon’t split anode if allmatchingrecords havethe sameoutput value©2005-2007 Carlos Guestrin25Base CaseTwoDon’t split anode if noneof theattributescan createmultiple non-emptychildren©2005-2007 Carlos Guestrin26Base Case Two:No attributescan distinguish©2005-2007 Carlos Guestrin27Base Cases Base Case One: If all records in current data subset have the sameoutput then don’t recurse Base Case Two: If all records have exactly the same set of inputattributes then don’t recurse©2005-2007 Carlos Guestrin28Base Cases: An idea Base Case One: If all records in current data subset have the sameoutput then don’t recurse Base Case Two: If all records have exactly the same set of inputattributes then don’t recurseProposed Base Case 3:If all attributes have zero informationgain then don’t recurse•Is this a good idea?©2005-2007 Carlos Guestrin29The problem with Base Case 3a b y0 0 00 1 11 0 11 1 0y = a XOR bThe information gains:The resulting decisiontree:©2005-2007 Carlos
View Full Document