Classification and Regression Trees 36 350 Data Mining 6 November 2009 Contents 1 Prediction Trees 1 2 Regression Trees 2 1 Example California Real Estate Again 2 2 Regression Tree Fitting 2 2 1 Cross Validation and Pruning in R 2 3 Uncertainty in Regression Trees 4 4 7 13 14 3 Classification Trees 3 1 Measuring Information 3 2 Making Predictions 3 3 Measuring Error 3 3 1 Misclassification Rate 3 3 2 Average Loss 3 3 3 Likelihood and Cross Entropy 3 3 4 Neyman Pearson Approach 18 19 20 20 20 21 21 23 4 Further Reading 24 5 Exercises 24 Reading Principles of Data Mining sections 10 5 and 5 2 in that order Berk chapter 3 Having built up increasingly complicated models for regression I ll now switch gears and introduce a class of nonlinear predictive model which at first seems too simple to possible work namely prediction trees These have two varieties regression trees and classification trees 1 Prediction Trees The basic idea is very simple We want to predict a response or class Y from inputs X1 X2 Xp We do this by growing a binary tree At each internal 1 node in the tree we apply a test to one of the inputs say Xi Depending on the outcome of the test we go to either the left or the right sub branch of the tree Eventually we come to a leaf node where we make a prediction This prediction aggregates or averages all the training data points which reach that leaf Figure 1 should help clarify this Why do this Predictors like linear or polynomial regression are global models where a single predictive formula is supposed to hold over the entire data space When the data has lots of features which interact in complicated nonlinear ways assembling a single global model can be very difficult and hopelessly confusing when you do succeed Some of the non parametric smoothers try to fit models locally and then paste them together but again they can be hard to interpret Additive models are at least pretty easy to grasp An alternative approach to nonlinear regression is to sub divide or partition the space into smaller regions where the interactions are more manageable We then partition the sub divisions again this is recursive partitioning as in hierarchical clustering until finally we get to chunks of the space which are so tame that we can fit simple models to them The global model thus has two parts one is just the recursive partition the other is a simple model for each cell of the partition Now look back at Figure 1 and the description which came before it Prediction trees use the tree to represent the recursive partition Each of the terminal nodes or leaves of the tree represents a cell of the partition and has attached to it a simple model which applies in that cell only A point x belongs to a leaf if x falls in the corresponding cell of the partition To figure out which cell we are in we start at the root node of the tree and ask a sequence of questions about the features The interior nodes are labeled with questions and the edges or branches between them labeled by the answers Which question we ask next depends on the answers to previous questions In the classic version each question refers to only a single attribute and has a yes or no answer e g Is HSGrad 0 78 or Is Region Midwest The variables can be of any combination of types continuous discrete but ordered categorical etc You could do more than binary questions but that can always be accommodated as a larger binary tree Asking questions about multiple variables at once is again equivalent to asking multiple questions about single variables That s the recursive partition part what about the simple local models For classic regression trees the model in each cell is just a constant estimate of Y That is suppose the points xi yi x2 y2 xc yc are all P the samples c belonging to the leaf node l Then our model for l is just y 1c i 1 yi the sample mean of the response variable in that cell This is a piecewise constant model 1 There are several advantages to this Making predictions is fast no complicated calculations just looking up 1 We could instead fit say a different linear regression for the response in each leaf node using only the data points in that leaf and using dummy variables for non quantitative features This would give a piecewise linear model rather than a piecewise constant one If we ve built the tree well however all the points in each leaf are pretty similar so the regression surface would be nearly constant anyway 2 Figure 1 Classification tree for county level outcomes in the 2008 Democratic Party primary as of April 16 by Amanada Cox for the New York Times 3 constants in the tree It s easy to understand what variables are important in making the prediction look at the tree If some data is missing we might not be able to go all the way down the tree to a leaf but we can still make a prediction by averaging all the leaves in the sub tree we do reach The model gives a jagged response so it can work when the true regression surface is not smooth If it is smooth though the piecewise constant surface can approximate it arbitrarily closely with enough leaves There are fast reliable algorithms to learn these trees A last analogy before we go into some of the mechanics One of the most comprehensible non parametric methods is k nearest neighbors find the points which are most similar to you and do what on average they do There are two big drawbacks to it first you re defining similar entirely in terms of the inputs not the response second k is constant everywhere when some points just might have more very similar neighbors than others Trees get around both problems leaves correspond to regions of the input space a neighborhood but one where the responses are similar as well as the inputs being nearby and their size can vary arbitrarily Prediction trees are adaptive nearest neighbor methods 2 Regression Trees Let s start with an example 2 1 Example California Real Estate Again After the homework and the last few lectures you should be more than familiar with the California housing data we ll try growing a regression tree for it There are several R packages for regression trees the easiest one is called simply tree calif read table teaching 350 hw 06 cadata dat header TRUE require tree treefit tree log MedianHouseValue Longitude Latitude data calif This does a tree regression of the log price on longitude and latitude What does this look like Figure 2 shows the tree itself Figure 3 shows the partition overlaid on the actual prices
View Full Document
Unlocking...