# ILLINOIS CS 446 - 092117.3 (7 pages)

Previewing pages 1, 2 of 7 page document
View Full Document

## 092117.3

Previewing pages 1, 2 of actual document.

View Full Document
View Full Document

## 092117.3

54 views

Pages:
7
School:
University of Illinois - urbana
Course:
Cs 446 - Machine Learning
##### Machine Learning Documents
Unformatted text preview:

CS446 Machine Learning Fall 2017 Lecture 8 Boosting Lecturer Sanmi Koyejo Scribe Yuyang Rao Sep 21th 2017 1 Recap 1 1 Map Maximum a posteriori MAP is an estimate of an unknown quantity that equals the mode of the posterior distribution It is one of the simplest case of regularization The model is map argmaxP Dn argmax P Dn P P Dn argmaxP Dn P argmax logP Dn logP 1 2 Variable Selection Forward and backward selection log P 0 LASSO Least absolute shrinkage and selection log P 1 Error function Error g bias variance Example if error function is defined as Error y f x 2 bias2 variance Algorithm 1 Bias 4 V ariance 2 then Error 42 2 18 Algorithm 2 Bias 3 V ariance 5 then Error 32 5 14 1 2 8 Boosting 2 Decision Tree CART 2 1 CART Classification and regression trees or CART models also called decision trees not to be confused with the decision trees used in decision theory are defined by recursively partitioning the input space and defining a local model in each resulting region of input space This can be represented by a tree with one leaf per region as we explain below Kevin 2012 Could be difficult to label boundary examples decide on training data adapt boxes to data Box may contain a mixture of data eg positive and negative data If no data in the box may not make decision Training cost depends on number of tests or number of splits Prediction speed is O depth of tree Momory to the model size of the tree number of tests Why do we care Worst case is that model size training data In general model size training data Bias and Variance In general bias of a decision tree is higher than Nearest neighbor classifiers NN and the variance of a decision tree is lower than NN Inductive Bias Data is axis separable Label partitioned in regions Scalable Robust to outlier Easy to mix discrete and continuous features 8 Boosting 3 2 2 Growing a Tree Finding the optimal partitioning of the data is NP complete Hyafil and Rivest 1976 so it is common to use the greedy procedure shown in Algorithm 6 to compute a locally optimal MLE This method is used by CART Breiman et al 1984 C4 5 Quinlan 1993 and ID3 Quinlan 1986 which are three popular implementations of the method Kevin 2012 See dtfit for a simple Matlab implementation To grow a decision tree we need a Dimension that we are splitting b Thresholds and regions The split function chooses the best feature and the best value for that feature as follows j t arg min j 1 D min cost xi yi xij t cost xi yi xij t t Tj where the cost function for a given dataset will be defined below For notational simplicity we have assumed all inputs are real valued or ordinal so it makes sense to compare a feature xij to a numeric value t The set of possible thresholds Tj for feature j can be obtained by sorting the unique values of xij For example if feature 1 has the values 4 5 12 72 12 then we set T1 12 4 5 72 In the case of categorical inputs the most common approach is to consider splits of the form xij ck and xij 6 ck for each possible class label ck Although we could allow for multi way splits resulting in non binary trees this would result in data fragmentation meaning too little data might fall into each subtree resulting in overfitting Kevin 2012 2 3 Cost Functions In the classification setting there are several ways to measure the quality of a split First we fit a multinoulli model to the data in the leaf satisfying the testXj t by estimating the class conditional probabilities as follows n c 1X 1 yi c n i 1 There are several common error measures for evaluating a proposed partition a Misclassification rate We define the most probable class label as yc argmaxc c The corresponding error rate is then n 1X 1 yi 6 f xi 1 y n i 1 4 8 Boosting b Entropy or deviance H c X c log c c 1 Note that minimizing the entropy is equivalent to maximizing the information gain c InforGain inf oGain H P arent H split H Y H Y Xj t X X p y c logp y c p y c Xj t logp c Xj t d Gini index C X c 1 c 1 c X c c X c c2 1 X c2 c This is the expected error rate To see this note that c is the probability a random entry in the leaf belongs to class c and 1 c is the probability it would be misclassified Figure 1 Node impurity measures for binary classification The horizontal axis corresponds to p the probability of class 1 The entropy measure has been rescaled to pass through 0 5 0 5 Based on Figure 9 3 of Hastie et al 2009 Figure generated by giniDemo Kevin 2012 In the figure above we see that the cross entropy and Gini measures are very similar and are more sensitive to changes in class probability than is the misclassification rate 8 Boosting 5 3 Boosting Binary Classifier Boosting Schapire and Freund 2012 is a greedy algorithm for fitting adaptive basis function models The algorithm works by applying the weak learner sequentially to weighted versions of the data where more weight is given to examples that were misclassified by earlier rounds Weak Classifier n ErrorRate f 1X 1 y f x n n 1 suppose random classifier f x 1 w p 0 5 1 w p 0 5 ErrorRate random classif ier 0 5 E f EX EY X 1 y 1 1 f x 1 1 y 1 1 f x 1 where EY X 1 y 1 X y1 y 1 P y x P Y 1 X In general for weak classifier error rate is less than 0 5 And it is slightly better than random classifier Step wise Suppose we have m different classifiers f1 f2 fm Step 1 Learn f1 Step 2 Learn f2 f1 f 12 f1 f2 Step 3 Learnf3 f2 f1 f 31 f1 f2 f3 Step m Learn fm fm 1 fm 2 f1 f 1 m Pm m 1 fm P Prediction sign m m 1 fm where fm m m X This shows that error rate is better than random classifier Boosting can achieve ANY accuracy depending on number of weak classifiers Boosting was originally derived in the computational learning theory literature Schapire 1990 Freund and Schapire 1996 where the focus is binary classification In these papers it 6 8 Boosting was proved that one could boost the performance on the training set of any weak learner arbitrarily high provided the weak learner could always perform slightly better than chance For example in Figure 2 we plot the training and test error for boosted decision stumps on a 2d dataset We see that the …

View Full Document

## Access the best Study Guides, Lecture Notes and Practice Exams Unlocking...