# ILLINOIS CS 446 - 092117.2 (9 pages)

Previewing pages*1, 2, 3*of 9 page document

**View the full content.**## 092117.2

Previewing pages *1, 2, 3*
of
actual document.

**View the full content.**View Full Document

## 092117.2

0 0 45 views

- Pages:
- 9
- School:
- University of Illinois - urbana
- Course:
- Cs 446 - Machine Learning

**Unformatted text preview: **

CS446 Machine Learning Fall 2017 Lecture 8 Decision Trees Lecturer Sanmi Koyejo Scribe Chaoyue Cui Aug 21th 2017 Outlines Recap CART Boosting Ada Boost Recap MAP MAP estimate is maximum a posterior probability estimate The posterior distribution of can be derived from Bayes Theorem P Dn P Dn P P Dn Then we can estimate as the mode of the posterior distribution In some situations it will be easier to calculate after taking log of the above equation and it won t influence the result Besides the denominator of the above equation is always positive and does not depend on Wiki 2017b So we can omit it in the derivation argmax P Dn argmax log P Dn log P Variable Selection Variable selection also known as feature selection attribute selection or variable subset selection is the process of selecting a subset of relevant features variables predictors for use in model construction Wiki 2017a Different methods to deal with log P in the MAP 1 2 8 Decision Trees logP k k0 where kk0 is the pseudo norm And there are two algorithms forward selection and backward selection logP k k1 where kk1 is the L1 norm Bias and Variance Error can be written as a function of bias and variance Error g bias variance Usually there are two cases in the above formula the bias goes up while the variance goes down and vice versa The value of bias and variance is usually not quite important and what matters is how much they change relative to each other Example Assume that the error function is square loss E y f x 2 bias2 variance For one algorithm the bias is 4 and variance is 3 In another algorithm the bias is 3 and variance is 5 In the previous case E 18 while in the latter case E 14 Even though the variance is small in the first case the error is larger Decision Tree CART Figure 1 Example 1 Suppose in a 2D coordinate the space is divided into boxes with orthogonal lines parallel to x1 and x2 axis There might be several data points in each box and the label of the data 8 Decision Trees 3 points is positive or negative See Figure 1 We want to draw a boundary to separate the regions with positive labels from regions with negative labels We can identify each small box as positive or negative and then split However there are a few issues listed below needing to be concerned We can not label the boundary examples Solution 1 Decide on 6 or Solution 2 Adapt boxes to the training data Boxes may have mixtures Solution We can use majority voting to decide the label It is possible that there is no data in the box Hence it might be not clear how to make decisions To avoid these problems we can use decision tree to deal with the problem Decision trees is also called classification and regression trees abbreviated

View Full Document