Classification Trees 36 350 Data Mining 29 Octber 2008 Classification trees work just like regression trees only they try to predict a discrete category the class rather than a numerical value The variables which go into the classification the inputs can be numerical or categorical themselves the same way they can with a regression tree They are useful for the same reasons regression trees are they provide fairly comprehensible predictors in situations where there are many variables which interact in complicated nonlinear ways 1 An Example Mushrooms A field guide to mushrooms lists 5416 varieties with 22 categorical attributes for each as well as the class which is edible or poisonous The first row cap shape cap surface cap color bruises odor gill attachment gill spacing convex fibrous red bruises none free close gill size gill color stalk shape stalk root stalk surface above ring broad purple tapering bulbous smooth stalk surface below ring stalk color above ring stalk color below ring smooth gray pink veil type veil color ring number ring type spore print color population partial white one pendant brown several habitat class woods edible Despite all the data the classification tree is simple Figure 1 It manages to classify perfectly using an OR rule if a mushroom smells bad OR it has green spores OR it grows in clusters then it is poisonous 1 2 Growing Classification Trees We find classification trees in almost the same way we found regression trees we start with a single node and then look for the binary distinction which gives 1 Apparently perfect classifications like this should make one suspicious I have not run this by a mycologist to see if it makes sense though it is the kind of result which is worth running by a mycologist Anyway if you go mushroom hunting on this basis and get sick and die don t blame me 1 creosote foul musty pungent 1383 class poisonous 100 odor green 41 class poisonous 100 spore print color almond anise none clustered 9 class poisonous 100 black brown purple white population abundant numerous scattered several solitary 2370 class edible 100 Figure 1 Classification tree for North American wild mushrooms 2 us the most information about the class We then take each of the resulting new nodes and repeat the process there continuing the recursion until we reach some stopping criterion The resulting tree will often be too large i e over fit so we prune it back using say cross validation The differences from regressiontree growing have to do with 1 how we measure information 2 what kind of predictions the tree makes and 3 how we measure predictive error 2 1 Measuring Information The response variable Y is categorical so we can use information theory to measure how much we learn about it from knowing the value of another discrete variable A X I Y A Pr A a I Y A a 1 a where I Y A a H Y H Y A a 2 and you remember the definitions of entropy H Y and conditional entropy H Y A a I Y A a is how much our uncertainty about Y decreases from knowing that A a Less subjectively how much less variable Y becomes when we go from the full population to the sub population where A a I Y A is how much our uncertainty about Y shrinks on average from knowing the value of A For classification trees A isn t necessarily one of the predictors but rather the answer to some question generally binary about one of the predictors X i e A 1A X for some set A This doesn t change any of the math above however So we chose the question in the first root node of the tree so as to maximize I Y A which we calculate from the formula above using the relative frequencies in our data to get the probabilities When we want to get good questions at subsequent nodes we have to take into account what we know already at each stage Computationally we do this by computing the probabilities and informations using only the cases in that node rather than the complete data set Remember that we re doing recursive partitioning so at each stage the sub problem looks just like a smaller version of the original problem Mathematically what this means is that if we reach the node when A a and B b we look for the question C which maximizes I Y C A a B b the information conditional on A a B b Algebraically I Y C A a B b H Y A a B b H Y A a B b C 3 Computationally rather than looking at all the cases in our data set we just look at the ones where A a and B b and calculate as though that were all the data Also notice that the first term on the right hand side H Y A a B b does not depend on the next question C So rather than maximizing I Y C A a B b we can just minimize H Y A a B b C 3 2 2 Making Predictions There are two kinds of predictions which a classification tree can make One is a point prediction a single guess as to the class or category to say this is a flower or this is a tiger and nothing more The other idea is to give a probability for each of the classes This is slightly more general because if we need to extract a point prediction from a probability forecast we can always do so but we can t go in the other direction For probability forecasts each terminal node in the tree gives us a distribution over the classes If the terminal node corresponds to the sequence of answers A a B b Q q then ideally this would give us Pr Y y A a B b Q q for each possible value y of the response A simple way to get close to this is to use the empirical relative frequencies of the classes in that node E g if there are 33 cases at a certain leaf 22 of which are tigers and 11 of which are flowers the leaf should predict tiger with probability 2 3 flower with probability 1 3 This is the maximum likelihood estimate of the true probability distribution c and we ll write it Pr Incidentally while the empirical relative frequencies are consistent estimates of the true probabilities under many circumstances nothing particularly compells us to use them When the number of classes is large relative to the sample size we may easily fail to see any samples at all of a particular class The empirical relative frequency of that class is then zero This is good if the actual probability is zero not so good otherwise In fact under the negative loglikelihood error discussed below it s infinitely bad because we will eventually see that class but our model will say it s impossible The empirical relative frequency estimator is in a …
View Full Document
Unlocking...