MSU CSE 847 - Study Guide - D2175044

Home> Schools> Michigan State University> Computer Science & Engineering (CSE) > CSE 847> Study Guide

MSU CSE 847 - Study Guide

Pages 33

Download Save

Unformatted text preview:

Decision Tree Rong Jin Determine Milage Per Gallon mpg good bad bad bad bad bad bad bad bad good bad good bad good good bad good bad cylinders displacement horsepower 4 6 4 8 6 4 4 8 8 8 8 4 6 4 4 8 4 5 low medium medium high medium low low high high high high low medium medium low high low medium low medium medium high medium medium medium high high medium high low medium low low high medium medium weight acceleration modelyear maker low medium medium high medium low low high high high high low medium low medium high low medium high medium low low medium medium low low low high low low high low high low medium medium 75to78 70to74 75to78 70to74 70to74 70to74 70to74 75to78 70to74 79to83 75to78 79to83 75to78 79to83 79to83 70to74 75to78 75to78 asia america europe america america asia asia america america america america america america america america america europe europe A Decision Tree for Determining MPG mpg good cylinders displacementhorsepower weight acceleration modelyear maker 4 low high asia low low From slides of Andrew Moore 75to78 Decision Tree Learning Extremely popular method Credit risk assessment Medical diagnosis Market analysis Good at dealing with symbolic feature Easy to comprehend Compared to logistic regression model and support vector machine Representational Power Q Can trees represent arbitrary Boolean expressions Q How many Boolean functions are there over N binary attributes How to Generate Trees from Training Data A Simple Idea Enumerate all possible trees Too many trees Check how well each tree matches with the training data How to determine the quality of Pick the one work best decision trees Problems Solution A Greedy Approach Choose the most informative feature Split data set Recursive until each data item is classified correctly How to Determine the Best Feature Which feature is more informative to MPG What metric should be used Mutual Information From Andrew Moore s slides Mutual Information for Selecting Best Features P x y P x P y Y MPG good or bad X cylinder 3 4 6 8 I X Y x y P x y log From Andrew Moore s slides Another Example Playing Tennis Example Playing Tennis Humidity High 3 4 9 5 Norm Weak 6 1 P h p P n p P n p log P h P p P n P p P h p P n p P h p log P n p log P h P p P n P p 0 151 I h P h p log Wind 6 2 9 5 Strong 3 3 P w p P s p P s p log P w P p P s P p P w p P s p P w p log P s p log P w P p P s P p 0 048 I w P w p log Predication for Nodes What is the predication for each node From Andrew Moore s slides Predication for Nodes Recursively Growing Trees cylinders 4 cylinders 5 Original Dataset Partition it according to the value of the attribute we split on cylinders 6 cylinders 8 From Andrew Moore slides Recursively Growing Trees Build tree from Build tree from Build tree from These records These records These records cylinders 4 cylinders 5 cylinders 6 Build tree from These records cylinders 8 From Andrew Moore slides A Two Level Tree Recursively growing trees When should We Stop Growing Trees Should we split this node Base Cases Base Case One If all records in current data subset have the same output then don t recurse Base Case Two If all records have exactly the same set of input attributes then don t recurse Base Cases An idea Base Case One If all records in current data subset have the same output then don t recurse Base Case Two If all records have exactly the same set of input attributes then don t recurse Proposed Base Case 3 If all attributes have zero information gain then don t recurse Is this a good idea Old Topic Overfitting What should We do Pruning Pruning Decision Tree Stop growing trees in time Build the full decision tree as before But when you can grow it no more start to prune Reduced error pruning Rule post pruning Reduced Error Pruning Split data into training and validation set Build a full decision tree over the training set Keep removing node that maximally increases validation set accuracy Original Decision Tree Pruned Decision Tree Reduced Error Pruning Rule Post Pruning Convert tree into rules Prune rules by removing the preconditions Sort final rules by their estimated accuracy Most widely used method e g C4 5 Other methods statistical significance test chisquare Real Value Inputs What should we do to deal with real value inputs mpg good bad bad bad bad bad bad bad good bad good bad cylinders displacementhorsepower weight acceleration modelyear maker 4 6 4 8 6 4 4 8 97 199 121 350 198 108 113 302 4 8 4 5 75 90 110 175 95 94 95 139 120 455 107 131 2265 2648 2600 4100 3102 2379 2228 3570 79 225 86 103 18 2 15 12 8 13 16 5 16 5 14 12 8 2625 4425 2464 2830 77 70 77 73 74 73 71 78 18 6 10 15 5 15 9 82 70 76 78 asia america europe america america asia asia america america america europe europe Information Gain x a real value input t split value Find the split value t such that the mutual information I x y t between x and the class label y is maximized Conclusions Decision trees are the single most popular data mining tool Easy to understand Easy to implement Easy to use Computationally cheap It s possible to get in trouble with overfitting They do classification predict a categorical output from categorical and or real inputs Software Most widely used decision tree C4 5 or C5 0 http www2 cs uregina ca hamilton courses 831 notes ml dtrees c4 5 tutorial html Source code tutorial The End

View Full Document


School:
Email:
New Password:
Confirm Password:

MSU CSE 847 - Study Guide

Sign up for free to view:

Please select your school