# ILLINOIS CS 446 - 092117.3 (7 pages)

Previewing pages*1, 2*of 7 page document

**View the full content.**## 092117.3

Previewing pages *1, 2*
of
actual document.

**View the full content.**View Full Document

## 092117.3

0 0 45 views

- Pages:
- 7
- School:
- University of Illinois - urbana
- Course:
- Cs 446 - Machine Learning

**Unformatted text preview: **

CS446 Machine Learning Fall 2017 Lecture 8 Boosting Lecturer Sanmi Koyejo Scribe Yuyang Rao Sep 21th 2017 1 Recap 1 1 Map Maximum a posteriori MAP is an estimate of an unknown quantity that equals the mode of the posterior distribution It is one of the simplest case of regularization The model is map argmaxP Dn argmax P Dn P P Dn argmaxP Dn P argmax logP Dn logP 1 2 Variable Selection Forward and backward selection log P 0 LASSO Least absolute shrinkage and selection log P 1 Error function Error g bias variance Example if error function is defined as Error y f x 2 bias2 variance Algorithm 1 Bias 4 V ariance 2 then Error 42 2 18 Algorithm 2 Bias 3 V ariance 5 then Error 32 5 14 1 2 8 Boosting 2 Decision Tree CART 2 1 CART Classification and regression trees or CART models also called decision trees not to be confused with the decision trees used in decision theory are defined by recursively partitioning the input space and defining a local model in each resulting region of input space This can be represented by a tree with one leaf per region as we explain below Kevin 2012 Could be difficult to label boundary examples decide on training data adapt boxes to data Box may contain a mixture of data eg positive and negative data If no data in the box may not make decision Training cost depends on number of tests or number of splits Prediction speed is O depth of tree Momory to the model size of the tree number of tests Why do we care Worst case is that model size training data In general model size training data Bias and Variance In general bias of a decision tree is higher than Nearest neighbor classifiers NN and the variance of a decision tree is lower than NN Inductive Bias Data is axis separable Label partitioned in regions Scalable Robust to outlier Easy to mix discrete and continuous features 8 Boosting 3 2 2 Growing a Tree Finding the optimal partitioning of the data is NP complete Hyafil and Rivest 1976 so it is common to use the greedy procedure shown in Algorithm 6 to compute a locally optimal MLE This method is used by CART Breiman et al 1984 C4 5 Quinlan 1993 and ID3 Quinlan 1986 which are three popular implementations of the method Kevin 2012 See dtfit for a simple Matlab implementation To grow a decision tree we need a Dimension that we are splitting b Thresholds and regions The split function chooses the best feature and the best value for that feature as follows j t arg min j 1 D min cost xi yi xij t cost xi yi xij t t Tj where the cost function for a given dataset will be defined below For notational simplicity we have assumed all inputs are real valued or ordinal so it makes sense to compare a feature

View Full Document