©Carlos Guestrin 2005-20071Decision Trees, cont.BoostingMachine Learning – 10701/15781Carlos GuestrinCarnegie Mellon UniversityOctober 1st, 2007©Carlos Guestrin 2005-20072A Decision Stump©Carlos Guestrin 2005-20073The final tree©Carlos Guestrin 2005-20074Basic Decision Tree BuildingSummarizedBuildTree(DataSet,Output) If all output values are the same in DataSet, return a leaf node that says“predict this unique output” If all input values are the same, return a leaf node that says “predict themajority output” Else find attribute X with highest Info Gain Suppose X has nX distinct values (i.e. X has arity nX). Create and return a non-leaf node with nX children. The i’th child should be built by callingBuildTree(DSi,Output)Where DSi built consists of all those records in DataSet for which X = ithdistinct value of X.©Carlos Guestrin 2005-20075MPG Testset error©Carlos Guestrin 2005-20076MPG Testset errorThe test set error is much worse than thetraining set error……why?©Carlos Guestrin 2005-20077Decision trees & Learning Biasmpg cylinders displacement horsepower weight acceleration modelyear makergood 4 low low low high 75to78 asiabad 6 medium medium medium medium 70to74 americabad 4 medium medium medium low 75to78 europebad 8 high high high low 70to74 americabad 6 medium medium medium medium 70to74 americabad 4 low medium low medium 70to74 asiabad 4 low medium low low 70to74 asiabad 8 high high high low 75to78 america: : : : : : : :: : : : : : : :: : : : : : : :bad 8 high high high low 70to74 americagood 8 high medium high high 79to83 americabad 8 high high high low 75to78 americagood 4 low low low low 79to83 americabad 6 medium medium medium high 75to78 americagood 4 medium low low low 79to83 americagood 4 low low medium high 79to83 americabad 8 high high high low 70to74 americagood 4 low medium low medium 75to78 europebad 5 medium medium medium medium 75to78 europe©Carlos Guestrin 2005-20078Decision trees will overfit Standard decision trees are have no learning biased Training set error is always zero! (If there is no label noise) Lots of variance Will definitely overfit!!! Must bias towards simpler trees Many strategies for picking simpler trees: Fixed depth Fixed number of leaves Or something smarter…©Carlos Guestrin 2005-20079Considerthis split©Carlos Guestrin 2005-200710A chi-square test Suppose that mpg was completely uncorrelated with maker. What is the chance we’d have seen data of at least this apparentlevel of association anyway?©Carlos Guestrin 2005-200711A chi-square test Suppose that mpg was completely uncorrelated with maker. What is the chance we’d have seen data of at least this apparent level ofassociation anyway?By using a particular kind of chi-square test, the answer is 7.2%(Such simple hypothesis tests are very easy to compute, unfortunately,not enough time to cover in the lecture,but in your homework, you’ll have fun! :))©Carlos Guestrin 2005-200712Using Chi-squared to avoid overfitting Build the full decision tree as before But when you can grow it no more, start toprune: Beginning at the bottom of the tree, delete splits inwhich pchance > MaxPchance Continue working you way up until there are no moreprunable nodesMaxPchance is a magic parameter you must specify to the decision tree,indicating your willingness to risk fitting noise©Carlos Guestrin 2005-200713Pruning example With MaxPchance = 0.1, you will see thefollowing MPG decision tree:Note the improvedtest set accuracycompared with theunpruned tree©Carlos Guestrin 2005-200714MaxPchance Technical note MaxPchance is a regularization parameter that helps usbias towards simpler modelsHigh Bias High VarianceMaxPchanceIncreasingDecreasingExpected Test setErrorWe’ll learn to choose the value of these magic parameters soon!©Carlos Guestrin 2005-200715Real-Valued inputs What should we do if some of the inputs are real-valued?mpg cylindersdisplacementhorsepower weight acceleration modelyear makergood 4 97 75 2265 18.2 77 asiabad 6 199 90 2648 15 70 americ abad 4 121 110 2600 12.8 77 europebad 8 350 175 4100 13 73 americabad 6 198 95 3102 16.5 74 americabad 4 108 94 2379 16.5 73 asiabad 4 113 95 2228 14 71 asiabad 8 302 139 3570 12.8 78 america: : : : : : : :: : : : : : : :: : : : : : : :good 4 120 79 2625 18.6 82 americabad 8 455 225 4425 10 70 americagood 4 107 86 2464 15.5 76 europebad 5 131 103 2830 15.9 78 europeInfinite number of possible split values!!!Finite dataset, only finite number of relevant splits!Idea One: Branch on each possible real value©Carlos Guestrin 2005-200716“One branch for each numericvalue” idea:Hopeless: with such high branching factor will shatterthe dataset and overfit©Carlos Guestrin 2005-200717Threshold splits Binary tree, split on attribute X One branch: X < t Other branch: X ¸ t©Carlos Guestrin 2005-200718Choosing threshold split Binary tree, split on attribute X One branch: X < t Other branch: X ¸ t Search through possible values of t Seems hard!!! But only finite number of t’s are important Sort data according to X into {x1,…,xm} Consider split points of the form xi + (xi+1 – xi)/2©Carlos Guestrin 2005-200719A better idea: thresholded splits Suppose X is real valued Define IG(Y|X:t) as H(Y) - H(Y|X:t) Define H(Y|X:t) =H(Y|X < t) P(X < t) + H(Y|X >= t) P(X >= t) IG(Y|X:t) is the information gain for predicting Y if all youknow is whether X is greater than or less than t Then define IG*(Y|X) = maxt IG(Y|X:t) For each real-valued attribute, use IG*(Y|X) forassessing its suitability as a split Note, may split on an attribute multiple times,with different thresholds©Carlos Guestrin 2005-200720Example with MPG©Carlos Guestrin 2005-200721Example tree using reals©Carlos Guestrin 2005-200722What you need to know aboutdecision trees Decision trees are one of the most popular data mining tools Easy to understand Easy to implement Easy to use Computationally cheap (to solve heuristically) Information gain to select attributes (ID3, C4.5,…) Presented for classification, can be used for regression anddensity estimation too Decision trees will overfit!!! Zero bias classifier ! Lots of variance Must use tricks to find “simple trees”, e.g., Fixed depth/Early stopping Pruning Hypothesis testing©Carlos Guestrin 2005-200723Acknowledgements Some of the material
View Full Document