Unformatted text preview:

COS 424: Interacting with DataLecturer: Robert Schapire Lecture # 5Scribe: Megan Lee February 20, 20071 Decision TreesDuring the last class, we talked about growing decision trees from a dataset. In doing this,there are two conflicting goals:• achieving a low training e rror• building a tree that is not too largeWe discussed a greedy, heuristic algorithm that would try to achieve both of these conflictinggoals.2 Classification ErrorSuppose that we cut off the growing process at various points over the growing processs,and we evaluate the error of the tree at that point and time. This would lead to a graphof size vs. error (where error is the probability of making a mistake). There are two errorrates to be considered:• training error (i.e. fraction of mistakes made on the training set)• testing error (i.e. fraction of mistakes made on the testing set)The error curves are as follows:tree size vs. training errortree size vs. testing errorAs the tree size increase s, training error decreases. However, as the tree size increases,testing error decreases at first since we expect the test data to be similar to the trainingdata, but at a certain point, the training algorithm starts training to the noise in the data,becoming less accurate on the testing data. At this point we are no longer fitting the dataand instead fitting the noise in the data. Therefore, the shap e of the testing error curvewill start to increase at a certain point at which the tree is too big and too c omplex toperform well on testing data (an application of Occam’s razor). This is called overfitting tothe data, in which the tree is fitted to spurious data. As the tree grows in size, it will fitthe training data perfectly and not be of practical use for other data such as the testing set.We want to choose a tree at the minimum of the curve, but we are not aware of the testcurve during training. We build the tree only using the training error curve, which appearsto be decreasing with tree size. Again, we have two conflicting goals. There is a tradeoffbetween training error and tree size.3 Tree SizeThere are two general methods of controlling the size of the tree:• grow the tree more carefully and try to end the growing process at an appropriatepoint early on• grow the biggest tree possible (one that completely fits the data), then prune it to besmaller (this is the more common method)One c ommon technique is to separate the training set into two parts, the growing set andthe pruning set. The tree is grown using only the growing set, and the pruning set is used toestimate the testing error of all possible subtrees that can be built, and the subtree with thelowest error on the pruning set is chosen as the decision tree. In this method, we are usingthe pruning s et as a proxy for the testing set with the hope of achieving a curve similarto the test curve when using the pruning set. For an example, 2/3 of the training set maybe used for growing, while 1/3 is used for pruning. A disadvantage of this method is thattraining data is wasted, a serious problem if the dataset is small. Another approach is totry to explicitly optimize a tradeoff between the number of errors and the size of the tree.Consider the value#training errors + constant × size of treeNow there is only one value that must be minimized to determine the optimal tree. Thisvalue attempts to capture the two conflicting interests simultaneously.4 Assumptions in creating decision treesAs with any algorithm, there are various assumptions that are made when building decisiontrees. Three of these assumptions are that:• The data can be described by features, such as the features of Batman characters.Sometimes we assume these features are discrete, but we can also use decision treeswhen the features are continuous. Binary decisions are made on the basis of continuousfeatures by determining a threshold that divides the range of values into intervalscorrelated with decisions.2• The class label can b e predicted using a logical set of decisions that can be summarizedby the decision tree.• The greedy procedure will be effective on the data that we are given, where effec tive-ness is achieved by finding a s mall tree with low error.5 Decision tree historyDecision trees have been widely used since the 1980s. CART was an algorithm widelyused in the statistical community, and ID3 and its successor, C4.5, were dominant in themachine learning community. These algorithms are fast procedures, fairly easy to program,and interpretable (i.e. understandable). A drawback of decision trees is that they aregenerally not as accurate as other machine learning methods. We will be looking at someof these state of the art algorithms later in the course.It is difficult to explain why decision trees are not optimally accurate. Decision trees mayfail if the data is probabilistic or if the data is noisy. Features in the tree are not weightedand simplicity is hard to control, with overfitting a constant problem while growing thetree.6 Theory: a mathematical model for the learning problemUntil now, we’ve taken a very intuitive approach. Although we know that we need data,low error, and a simple rule, there are still many unresolved questions. For an example,what does it mean for a rule to be simple? Why is simplicity so important? How much datais enough? What can we guarantee about accuracy? And how can we explain overfitting?We want to formalize the learning problem and define a measure of complexity.6.1 DataTraining and testing examples should be similar. For an example, if we were classifyingimages of handwritten digits by the digits they represent, we would want training examplesof the digits 0-9, not, for example, only 0-5. In the latter c ase , the algorithm would failmiserably. Generally, the testing and training examples can be similar if they are producedby the same process.The following is a formalization of this idea of the testing and training examples beinggenerated by the same process. Assume that the data is random, and the testing andtraining examples are generated by the same source, a distribution D. From this distributionD, we get an example, x. The distribution is unknown, but all examples are IID since theyoriginate from the same distribution. There is also a target function, c(x), that indicatesthe true labe l of each example. During training, the learning algorithm is given a set ofexamples, x1, . . . , xn, each from the distribution D, and each


View Full Document

Princeton COS 424 - Lecture # 5

Download Lecture # 5
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture # 5 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture # 5 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?