# ILLINOIS CS 446 - 092117.2 (9 pages)

Previewing pages*1, 2, 3*of 9 page document

**View the full content.**## 092117.2

Previewing pages *1, 2, 3*
of
actual document.

**View the full content.**View Full Document

## 092117.2

0 0 35 views

- Pages:
- 9
- School:
- University of Illinois - urbana
- Course:
- Cs 446 - Machine Learning

**Unformatted text preview: **

CS446 Machine Learning Fall 2017 Lecture 8 Decision Trees Lecturer Sanmi Koyejo Scribe Chaoyue Cui Aug 21th 2017 Outlines Recap CART Boosting Ada Boost Recap MAP MAP estimate is maximum a posterior probability estimate The posterior distribution of can be derived from Bayes Theorem P Dn P Dn P P Dn Then we can estimate as the mode of the posterior distribution In some situations it will be easier to calculate after taking log of the above equation and it won t influence the result Besides the denominator of the above equation is always positive and does not depend on Wiki 2017b So we can omit it in the derivation argmax P Dn argmax log P Dn log P Variable Selection Variable selection also known as feature selection attribute selection or variable subset selection is the process of selecting a subset of relevant features variables predictors for use in model construction Wiki 2017a Different methods to deal with log P in the MAP 1 2 8 Decision Trees logP k k0 where kk0 is the pseudo norm And there are two algorithms forward selection and backward selection logP k k1 where kk1 is the L1 norm Bias and Variance Error can be written as a function of bias and variance Error g bias variance Usually there are two cases in the above formula the bias goes up while the variance goes down and vice versa The value of bias and variance is usually not quite important and what matters is how much they change relative to each other Example Assume that the error function is square loss E y f x 2 bias2 variance For one algorithm the bias is 4 and variance is 3 In another algorithm the bias is 3 and variance is 5 In the previous case E 18 while in the latter case E 14 Even though the variance is small in the first case the error is larger Decision Tree CART Figure 1 Example 1 Suppose in a 2D coordinate the space is divided into boxes with orthogonal lines parallel to x1 and x2 axis There might be several data points in each box and the label of the data 8 Decision Trees 3 points is

View Full Document