# ILLINOIS CS 446 - 092117.2 (9 pages)

Previewing pages*1, 2, 3*of 9 page document

**View the full content.**## 092117.2

Previewing pages
*1, 2, 3*
of
actual document.

**View the full content.**View Full Document

## 092117.2

0 0 54 views

- Pages:
- 9
- School:
- University of Illinois - urbana
- Course:
- Cs 446 - Machine Learning

**Unformatted text preview:**

CS446 Machine Learning Fall 2017 Lecture 8 Decision Trees Lecturer Sanmi Koyejo Scribe Chaoyue Cui Aug 21th 2017 Outlines Recap CART Boosting Ada Boost Recap MAP MAP estimate is maximum a posterior probability estimate The posterior distribution of can be derived from Bayes Theorem P Dn P Dn P P Dn Then we can estimate as the mode of the posterior distribution In some situations it will be easier to calculate after taking log of the above equation and it won t influence the result Besides the denominator of the above equation is always positive and does not depend on Wiki 2017b So we can omit it in the derivation argmax P Dn argmax log P Dn log P Variable Selection Variable selection also known as feature selection attribute selection or variable subset selection is the process of selecting a subset of relevant features variables predictors for use in model construction Wiki 2017a Different methods to deal with log P in the MAP 1 2 8 Decision Trees logP k k0 where kk0 is the pseudo norm And there are two algorithms forward selection and backward selection logP k k1 where kk1 is the L1 norm Bias and Variance Error can be written as a function of bias and variance Error g bias variance Usually there are two cases in the above formula the bias goes up while the variance goes down and vice versa The value of bias and variance is usually not quite important and what matters is how much they change relative to each other Example Assume that the error function is square loss E y f x 2 bias2 variance For one algorithm the bias is 4 and variance is 3 In another algorithm the bias is 3 and variance is 5 In the previous case E 18 while in the latter case E 14 Even though the variance is small in the first case the error is larger Decision Tree CART Figure 1 Example 1 Suppose in a 2D coordinate the space is divided into boxes with orthogonal lines parallel to x1 and x2 axis There might be several data points in each box and the label of the data 8 Decision Trees 3 points is positive or negative See Figure 1 We want to draw a boundary to separate the regions with positive labels from regions with negative labels We can identify each small box as positive or negative and then split However there are a few issues listed below needing to be concerned We can not label the boundary examples Solution 1 Decide on 6 or Solution 2 Adapt boxes to the training data Boxes may have mixtures Solution We can use majority voting to decide the label It is possible that there is no data in the box Hence it might be not clear how to make decisions To avoid these problems we can use decision tree to deal with the problem Decision trees is also called classification and regression trees abbreviated as CART It is defined by recursively partitioning the input space and defining a local model in each resulting region of input space Murphy 2012 Consider another splitting problem shown in Figure 2 First we can find a line x1 3 parallel to x2 axis that can split the two kinds of labels In the case of the figure the left part of the line is labeled by However in the right part of the line there are both and Hence we we continue to draw another line x2 1 parallel to x1 axis which separate label and The decisions can be expressed more clearly in Figure 3 Figure 2 Example 2 4 8 Decision Trees Figure 3 Decision Tree Properties of decision trees Training costs depend on the number of tests or splits The memory needed to store the model is proportional to the size of the decision tree Question Why should we care about the memory Answer In the worst case the model size id equal to the training data in general the model size is much smaller than the training data The prediction speed depends on the depth of the tree The bias of decision tree is higher than the Nearest Neighbor method while the variance is lower than that of Nearest Neighbor method There are two inductive bias for decision tree Firstly the data is axis separable Secondly the label or class id partitioned in region There are some advantages of decision tree It is scalable It is robust to outliers It is easy to mix discrete and continuous features Growing a Decision Tree When we want to grow a decision tree we need to consider two issues Firstly we need to determine the dimension of the space we are going to split Secondly for each dimension we need to decide the threshold to split the region like what we did in the previous example 8 Decision Trees 5 For each dimension the potential number of threshold is equal to the number of samples The split function chooses the best feature and the best value for that feature as follows Murphy 2012 j t argmin min cost xi yi xij 6 t cost xi yi xij t j 1 2 D t j where j is all thresholds for dimension j and cost is some abstract function used to evaluate the quality of a proposed split It depends on whether the case is regression or classification Murphy 2012 In a classification setting there are many ways to measure the quality of a split First we fit a multinoulli model to the data in the leaf satisfying the test Xj t by estimating the class conditional probabilitiesMurphy 2012 n 1X c I yi c n i 1 where n is the number of the data Give this there are several error measures for evaluating a proposed partition Misclassification rate error rate n 1X H I yi 6 f xi n i 1 Entropy C 1X c log c n c 1 Information gain Inf oGain H parent H split H Y H Y Xj t Gini index c is the probability a random entry in the leaf belongs to class c n X c 1 c i 1 In the two class case where p m 1 the misclassification rate is 1max p 1p the entropy is H2 p and the Gini index is 2p 1p These are plotted in Figure 4 We can see that the cross entropy and Gini measures are very similar and are more sensitive to changes in class probability than is the misclassification rate Murphy 2012 6 8 Decision Trees Figure 4 Node impurity measures for binary classification Murphy 2012 Boosting Boosting is to learn a strong classifier by stagewise combining weak classifier The two key terms in the definition are stagewise combining and weak classifier Weak Classifier Suppose that the error rate for a classifier is as follows n 1X I yi c n i 1 Assume that this classifier is a random classifier defined as f x 1 1 prob 0 5 prob 0 5 The error rate for a random classifier is 0 5 The derivation is as follow …

View Full Document