**Unformatted text preview:**

Decision Trees Nicholas Ruozzi University of Texas at Dallas Based on the slides of Vibhav Gogate and David Sontag Supervised Learning Input labeled training data i e data plus desired output Assumption there exists a function that maps data items to their correct labels Goal construct an approximation to 2 Today We ve been focusing on linear separators Relatively easy to learn using standard techniques Easy to picture but not clear if data will be separable Next two lectures we ll focus on other hypothesis spaces Decision trees Nearest neighbor classification 3 Application Medical Diagnosis Suppose that you go to your doctor with flu like symptoms How does your doctor determine if you have a flu that requires medical attention 4 Application Medical Diagnosis Suppose that you go to your doctor with flu like symptoms How does your doctor determine if you have a flu that requires medical attention Check a list of symptoms Do you have a fever over 100 4 degrees Fahrenheit Do you have a sore throat or a stuffy nose Do you have a dry cough 5 Application Medical Diagnosis Just having some symptoms is not enough you should also not have symptoms that are not consistent with the flu For example If you have a fever over 100 4 degrees Fahrenheit And you have a sore throat or a stuffy nose You probably do not have the flu most likely just a cold 6 Application Medical Diagnosis In other words your doctor will perform a series of tests and ask a series of questions in order to determine the likelihood of you having a severe case of the flu This is a method of coming to a diagnosis i e a classification of your condition We can view this decision making process as a tree 7 Decision Trees A tree in which each internal non leaf node tests the value of a particular feature you should play tennis Each leaf node specifies a class label in this case whether or not 8 Decision Trees Features Outlook Humidity Wind Classification is performed root to leaf The feature vector Sunny Normal Strong would be classified as a yes instance 9 Decision Trees Can have continuous features too Internal nodes for continuous features correspond to thresholds 10 Decision Trees Decision trees divide the feature space into axis parallel rectangles 2 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 1 11 2 4 1 6 1 4 2 5 2 2 Decision Trees Decision trees divide the feature space into axis parallel rectangles 2 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 1 12 2 4 1 6 1 4 2 5 2 2 Decision Trees Decision trees divide the feature space into axis parallel rectangles 2 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 1 13 2 4 1 6 1 4 2 5 2 2 Decision Trees Decision trees divide the feature space into axis parallel rectangles 2 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 1 14 2 4 1 6 1 4 2 5 2 2 Decision Trees Decision trees divide the feature space into axis parallel rectangles 2 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 1 15 2 4 1 6 1 4 2 5 2 2 Decision Trees Decision trees divide the feature space into axis parallel rectangles 2 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 1 16 2 4 1 6 1 4 2 5 2 2 Decision Trees Worst case decision tree may require exponentially in the dimension of the data many nodes 1 2 0 0 1 1 17 Decision Tree Learning Basic decision tree building algorithm Pick some feature attribute Partition the data based on the value of this attribute Recurse over each new partition 18 Decision Tree Learning Basic decision tree building algorithm Pick some feature attribute how to pick the best Partition the data based on the value of this attribute Recurse over each new partition when to stop We ll focus on the discrete case first i e each feature takes a value in some finite set 19 Decision Trees What functions can be represented by decision trees Every function can be represented by a sufficiently complicated decision tree Are decision trees unique 20 Decision Trees What functions can be represented by decision trees Every function of can be represented by a sufficiently complicated decision tree Are decision trees unique No many different decision trees are possible for the same set of labels 21 Choosing the Best Attribute Because the complexity of storage and classification increases with the size of the tree should prefer smaller trees Simplest models that explain the data are usually preferred over more complicated ones Finding the smallest tree is an NP hard problem Instead use a greedy heuristic based approach to pick the best attribute at each stage 22 Choosing the Best Attribute Which attribute should you split on 1 2 0 1 1 2 1 0 1 0 0 4 3 1 1 3 2 2 1 1 1 1 0 0 0 0 1 0 1 0 1 0 1 0 23 Choosing the Best Attribute Which attribute should you split on 1 2 0 1 1 2 1 0 1 0 0 4 3 1 1 3 2 2 Can think of these counts as probability distributions over the labels if the probability that is equal to 1 1 24 1 1 1 1 0 0 0 0 1 0 1 0 1 0 1 0 Choosing the Best Attribute The selected attribute is a good split if we are more certain about the classification after the split If each partition with respect to the chosen attribute has a distinct class label we are completely certain about the classification after partitioning If the class labels are evenly divided between the partitions the split isn t very good we are very uncertain about the label for each partition What about other situations How do you measure the uncertainty of a random process 25 Discrete Probability Sample space specifies the set of possible outcomes would be the set of possible For example outcomes of a coin flip H T Each element called a probability is associated with a number p 0 1 For example a biased coin might have and 1 4 26 6 Discrete Probability An event is a subset of the sample space Let role 1 2 3 4 5 6 be the 6 possible outcomes of a dice would be the event that the dice roll comes up as a one five or six 1 5 6 The probability of an event is just the sum of all of the outcomes that it contains 1 5 6 27 Independence Two events A and B are independent if Let s suppose that we have a fair …

View Full Document