CS 6375 Machine Learning Decision Trees Instructor Yang Liu 1 Supervised Classifier X1 X2 XM Ref class label 2 1 Decision Tree Example I Three variables Attribute 1 Hair blond dark Attribute 2 Height tall short Class Country G P 3 The class of a new input can be classified by following the tree all the way down to a leaf and by reporting the output of the leaf For example B T is classified as D S is classified as 4 2 Decision Trees Decision Trees are classifiers Instances represented as feature vectors Nodes are equality and inequality tests for feature values There is one branch for each value of the feature Leaves specify the categories labels Can categorize instances into multiple disjoint categories 5 General Case Discrete Attributes We have R observations from training data Each observation has M attributes X1 XM Each Xi can take Ni distinct discrete values Each observation has a class attribute Y with C distinct discrete values Problem Construct a sequence of tests on the attributes such that given a new input x 1 x M the class attribute y is correctly predicted X attributes of training data RxM Y Class of training data R Note There are other notations used for data representation 6 3 General Decision Tree Discrete Attributes 7 Decision Tree Example II 8 4 The class of a new input can be classified by following the tree all the way down to a leaf and by reporting the output of the leaf For example 0 2 0 8 is classified as 0 8 0 2 is classified as 9 General Case Continuous Attributes We have R observations from training data Each observation has M attributes X1 XM Each Xi can take continuous values Each observation has a class attribute Y with C distinct discrete values Problem Construct a sequence of tests of the form Xi ti on the attributes such that given a new input x 1 x M the class attribute y is correctly predicted X attributes of training data RxM Y Class of training data R 10 5 General Decision Tree Continuous Attributes 11 Side Decision Tree Decision Boundaries Decision trees divide the feature space into axis parallel rectangles and label each rectangle with one of the class labels 12 6 Decision Boundaries Also called decision surfaces high dimensional space Take binary classification as an example it is a hypersurface that partitions the feature space into two sets one for each class If the decision surface is a hyperplane then the classification problem is linear and the classes are linearly separable Instances on one side of the hyperplane belong to one class those on the other side belong to the other class 13 Basic Questions How to choose the attribute value to split at each level of the tree When to stop splitting When should a node be declared a leaf If a leaf node is impure how should the class label be assigned 14 7 How to choose the attribute value to split on at each level of the tree Two classes red circles green crosses Two attributes X1 and X2 11 points in training data Idea Construct a decision tree such that the leaf nodes predict correctly the class for all the training examples 15 How to choose the attribute value to split on at each level of the tree 16 8 This node is pure because there is only one class left No ambiguity in the class label This node is almost pure Little ambiguity in the class label These nodes contain a mixture of classes Do not disambiguate between the classes 17 We want to find the most compact smallest size tree Occam s razor that classifies the training data correctly We want to find the split choices that will get us the fastest to pure nodes This node is pure because there is only one class left No ambiguity in the class label This node is almost pure Little ambiguity in the class label These nodes contain a mixture of classes Do not disambiguate between the classes 18 9 Entropy Entropy is a measure of the impurity of a distribution defined as Pi probability of occurrence of value i High entropy All the classes are nearly equally likely Low entropy A few classes are likely most of the classes are rarely observed assume 0 log2 0 0 19 Entropy The entropy captures the degree of purity of the distribution 20 10 Example Entropy Calculation 21 Conditional Entropy Entropy before splitting H After splitting a fraction PL of the data goes to the left node which has entropy HL After splitting a fraction PR of the data goes to the right node which has entropy HR The average entropy or conditional entropy after splitting is 22 11 Information Gain We want nodes as pure as possible We want to reduce the entropy as much as possible We want to maximize the difference between the entropy of the parent node and the expected entropy of the children Maximize 23 Notations Entropy H Y Entropy of the distribution of classes at a node Conditional Entropy Discrete H Y Xj Entropy after splitting with respect to variable j Continuous H Y Xj t Entropy after splitting with respect to variable j with threshold t Information gain Discrete IG Y Xj H Y H Y Xj Continuous IG Y Xj t H Y H Y Xj t 24 12 Information Gain We want nodes as pure as possible We want to reduce the entropy as much as possible We want to maximize the difference between the entropy of the parent node and the expected entropy of the children Maximize Information Gain IG Amount by which the ambiguity is decreased by splitting the node 25 26 13 27 Another Illustrative Example Day Outlook Temperature 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Sunny Sunny Overcast Rain Rain Rain Overcast Sunny Sunny Rain Sunny Overcast Overcast Rain Hot Hot Hot Mild Cool Cool Cool Mild Cool Mild Mild Mild Hot Mild Humidity Wind High High High High Normal Normal Normal High Normal Normal Normal High Normal High Weak Strong Weak Weak Weak Strong Strong Weak Weak Weak Strong Strong Weak Strong PlayTennis No No Yes Yes Yes No Yes No Yes Yes Yes Yes Yes No 28 14 Another Illustrative Example Day Outlook Temperature Entropy S 9 5 14 14 0 94 log 9 log 5 14 14 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Sunny Sunny Overcast Rain Rain Rain Overcast Sunny Sunny Rain Sunny Overcast Overcast Rain Hot Hot Hot Mild Cool Cool Cool Mild Cool Mild Mild Mild Hot Mild Humidity Wind High High High High Normal Normal Normal High Normal Normal Normal High Normal High PlayTennis Weak Strong Weak Weak Weak Strong Strong Weak Weak Weak Strong Strong Weak Strong No No Yes Yes Yes No Yes No Yes Yes Yes Yes Yes No 9 5 29 Another Illustrative Example Humidity Wind High Weak High Strong High Weak High Weak Normal Weak Normal Strong Normal Strong High
View Full Document