Decision TreeDetermine Milage Per GallonA Decision Tree for Determining MPGDecision Tree LearningRepresentational PowerHow to Generate Trees from Training DataA Simple IdeaSolution: A Greedy ApproachHow to Determine the Best Feature?Mutual Information for Selecting Best FeaturesAnother Example: Playing TennisExample: Playing TennisPredication for NodesSlide 14Recursively Growing TreesSlide 16A Two Level TreeWhen should We Stop Growing Trees?Base CasesBase Cases: An ideaOld Topic: OverfittingWhat should We do ?Pruning Decision TreeReduced Error PruningOriginal Decision TreePruned Decision TreeSlide 27Rule Post-PruningReal Value InputsInformation GainConclusionsSoftwarePowerPoint PresentationDecision TreeRong JinDetermine Milage Per Gallonmpg cylinders displacement horsepower weight acceleration modelyear makergood 4 low low low high 75to78 asiabad 6 medium medium medium medium 70to74 americabad 4 medium medium medium low 75to78 europebad 8 high high high low 70to74 americabad 6 medium medium medium medium 70to74 americabad 4 low medium low medium 70to74 asiabad 4 low medium low low 70to74 asiabad 8 high high high low 75to78 america: : : : : : : :: : : : : : : :: : : : : : : :bad 8 high high high low 70to74 americagood 8 high medium high high 79to83 americabad 8 high high high low 75to78 americagood 4 low low low low 79to83 americabad 6 medium medium medium high 75to78 americagood 4 medium low low low 79to83 americagood 4 low low medium high 79to83 americabad 8 high high high low 70to74 americagood 4 low medium low medium 75to78 europebad 5 medium medium medium medium 75to78 europeA Decision Tree for Determining MPGFrom slides of Andrew Moorempg cylinders displacementhorsepower weight acceleration modelyear maker4 low low low high 75to78 asiagoodDecision Tree LearningExtremely popular methodCredit risk assessmentMedical diagnosisMarket analysisGood at dealing with symbolic featureEasy to comprehendCompared to logistic regression model and support vector machineRepresentational PowerQ: Can trees represent arbitrary Boolean expressions?Q: How many Boolean functions are there over N binary attributes?How to Generate Trees from Training DataA Simple IdeaEnumerate all possible treesCheck how well each tree matches with the training dataPick the one work bestToo many treesProblems ?How to determine the quality of decision trees?Solution: A Greedy ApproachChoose the most informative featureSplit data setRecursive until each data item is classified correctlyHow to Determine the Best Feature?Which feature is more informative to MPG?What metric should be used?From Andrew Moore’s slidesMutual Information !Mutual Information for Selecting Best Features,( , )( ; ) ( , )log( ) ( ): MPG (good or bad), : cylinder (3, 4, 6, 8)x yP x yI X Y P x yP x P yY X=�From Andrew Moore’s slidesAnother Example: Playing TennisExample: Playing TennisHumidityHigh Norm(9+, 5-)(3+, 4-)(6+, 1-)( , ) ( , )( , ) log ( , ) log( ) ( ) ( ) ( )( , ) ( , )( , ) log ( , ) log( ) ( ) ( ) ( )0.151hP h p P n pI P h p P n pP h P p P n P pP h p P n pP h p P n pP h P p P n P p= + +� �� + �� �=WindWeak Strong(9+, 5-)(6+, 2-)(3+, 3-)( , ) ( , )( , ) log ( , ) log( ) ( ) ( ) ( )( , ) ( , )( , ) log ( , ) log( ) ( ) ( ) ( )0.048wP w p P s pI P w p P s pP w P p P s P pP w p P s pP w p P s pP w P p P s P p= + +� �� + �� �=Predication for NodesFrom Andrew Moore’s slidesWhat is the predication for each node?Predication for NodesRecursively Growing TreesOriginalDatasetPartition it accordingto the value of the attribute we split oncylinders = 4 cylinders = 5cylinders = 6 cylinders = 8From Andrew Moore slidesRecursively Growing Treescylinders = 4 cylinders = 5cylinders = 6 cylinders = 8Build tree fromThese records..Build tree fromThese records..Build tree fromThese records..Build tree fromThese records..From Andrew Moore slidesA Two Level TreeRecursively growing treesWhen should We Stop Growing Trees?Should we split this node ?Base CasesBase Case One: If all records in current data subset have the same output then don’t recurseBase Case Two: If all records have exactly the same set of input attributes then don’t recurseBase Cases: An ideaBase Case One: If all records in current data subset have the same output then don’t recurseBase Case Two: If all records have exactly the same set of input attributes then don’t recurseProposed Base Case 3:If all attributes have zero information gain then don’t recurseIs this a good idea?Old Topic: OverfittingWhat should We do ?PruningPruning Decision TreeStop growing trees in timeBuild the full decision tree as before.But when you can grow it no more, start to prune:Reduced error pruningRule post-pruningReduced Error PruningSplit data into training and validation setBuild a full decision tree over the training setKeep removing node that maximally increases validation set accuracyOriginal Decision TreePruned Decision TreeReduced Error PruningRule Post-PruningConvert tree into rulesPrune rules by removing the preconditionsSort final rules by their estimated accuracyMost widely used method (e.g., C4.5)Other methods: statistical significance test (chi-square)Real Value InputsWhat should we do to deal with real value inputs?mpg cylindersdisplacementhorsepower weight acceleration modelyear makergood 4 97 75 2265 18.2 77 asiabad 6 199 90 2648 15 70 americabad 4 121 110 2600 12.8 77 europebad 8 350 175 4100 13 73 americabad 6 198 95 3102 16.5 74 americabad 4 108 94 2379 16.5 73 asiabad 4 113 95 2228 14 71 asiabad 8 302 139 3570 12.8 78 america: : : : : : : :: : : : : : : :: : : : : : : :good 4 120 79 2625 18.6 82 americabad 8 455 225 4425 10 70 americagood 4 107 86 2464 15.5 76 europebad 5 131 103 2830 15.9 78 europeInformation Gainx: a real value inputt: split valueFind the split value t such that the mutual information I(x, y: t) between x and the class label y is maximized.ConclusionsDecision trees are the single most popular data mining toolEasy to understandEasy to implementEasy to useComputationally cheapIt’s possible to get in trouble with overfittingThey do classification: predict a categorical output from categorical and/or real inputsSoftwareMost widely used decision tree C4.5 (or C5.0)http://www2.cs.uregina.ca/~hamilton/courses/831/notes/ml/dtrees/c4.5/tutorial.htmlSource code, tutorialThe
View Full Document