# ILLINOIS CS 446 - 092817.1 (7 pages)

Previewing pages*1, 2*of 7 page document

**View the full content.**## 092817.1

Previewing pages
*1, 2*
of
actual document.

**View the full content.**View Full Document

## 092817.1

0 0 53 views

- Pages:
- 7
- School:
- University of Illinois - urbana
- Course:
- Cs 446 - Machine Learning

**Unformatted text preview:**

CS446 Machine Learning Fall 2017 Lecture 10 Boosting Bagging and Surrogate Losses Lecturer Sanmi Koyejo Scribe Triveni Putti Sep 28 2017 1 Review of previous lecture Erratum Error of a random classifier h is given as follows n err h 1 1X I yi h xi n i 1 where h xi 1 1 with probability 0 5 otherwise The expected value of the error of this classifier is 0 5 i e E error h 0 5 It was erroneously stated that error h 0 5 Generally we split the data into two parts one for training and one for testing But when we want to select hyper parameters for any machine learning model such as for pruning in case of decision trees and for regularization parameter in case of linear regression we split the training set further into actual training and validation sets Validation set is used to set the hyper parameters Hence overall the data is now split into 3 parts training validation and test sets The way of selecting these three datasets should be random to avoid any innate correlation in the order in which data is stored Generally we cannot use test set instead of validation set to select the hyper parameters because then we will be overconfident about the generalization error and if there is a new test data then the parameters should be re tuned Same approach is valid for Cross validation as well In this case on a high level training and testing happens but there will be another Cross validation loop within training for finding the best hyper parameters Hence it is also called double Cross validation or inner loop Cross validation 1 2 10 Boosting Bagging and Surrogate Losses 2 Boosting The goal of boosting is to learn a classifier which is of the form f x K X m m x m 1 where m is the weight and m is a weak classifier Consider a 1 D data with distribution as shown below It is hard to classify using a linear classifier in this case because there is no linear classifier that will give the perfect classification So let us try to implement boosting algorithm on this data For the first iteration the weight Wn 1 will be equal to n1 which is same for all data points hence it is shown as a straight line in Figure 1 1 x is the result of the first iteration Now for step 2 the weights are redistributed The weight Wn 2 becomes higher wherever Figure 1 Step 1 of boosting The weights function is constant for all datapoints there was a mistake in classifying and lower wherever the classification was correct in Step 1 Hence the weights are increased for the two wrongly classified positives in the middle The classifier is more concerned about making good predictions in the wrongly classified areas as compared to others So 2 x will now predict the wrongly classified positives correctly but makes a mistake in the right most negative predictions On combining 1 x and 2 x based on their corresponding weights we get f2 X 1 1 x 2 2 x Its predictions are shown in Figure 2 For step 3 the weight Wn 3 will be increased for the wrong predictions right negatives After few iterations the final classifier will be more complex and will be able to identify correctly the boundary between positives and negatives in the middle The final classifier is fm x K X m x m x m 1 Few points to note The bias of the final classifier is very low as compared to the bias of the individual classifiers 10 Boosting Bagging and Surrogate Losses 3 Figure 2 Further steps of boosting Although conceptually there is a possibility to overfit practically it does not because each classifier is weak and therefore the additional change to the model in every step is fairly small Also since this classifier maximizes the margin it is unlikely to overfit In order to decide how many iterations to perform validation set should be used P We are essentially trying to minimize the weighted loss function Lm h n1 ni 1 wi m l yi xi that is fit in every mth step where the weights wi m are re weighted Note that the weights are always positive and often normalized Can I run boosting algorithm on a classifier which cannot add weights to the loss function P resulting in a loss function like Ln h n1 ni 1 l yi xi If so how Answer Yes it is possible We need to resample the dataset sample with replacement with the probability of selecting each example i proportional to its re normalized weight wi Exercise Let us say there are n data points x1 y1 x2 y2 xn yn with probabilities P1 P2 Pn 4 10 Boosting Bagging and Surrogate Losses respectively How do you sample n points with replacement from the above dataset using their probabilities Multiclass boosting using Adaboost The algorithm used here is Stagewise additive modeling using multiclass exponential loss SAMME and we follow the same steps as that of binary boosting with only one change Following are the steps for a dataset with classes that belong to 1 2 c Step1 Start with choosing wi n1 Step2 For m 1 2 K Fit a weak classifier m x to the training set using weights wm Pn i 1 Compute errm w I yi 6 m xi Pi m n i 1 wi m m Compute m log 1 err errm log c 1 Adding a bias term here is the only change as compared to binary boosting wi wi exp m I yi 6 m xi Normalize wi Step 3 Return f x PK m 1 m m x Boosting is an ensemble learner Ensemble means combination of multiple sub learners Boosting is called ensemble learner because it uses a group of weak learners that are combined using weights 3 Bagging Bagging is also a kind of ensemble method It is an acronym for Bootstrap aggregating Bootstrap Sampling This technique is used when we are trying to understand variation in a procedure by re sampling It is a trick to emulate random variation in the data Given a data set Dn xi yi ni 1 a bootstrap sample can be created by sampling with replacements uniformly from Dn Now let us say we created k such bootstrap samples of Dn which are called D 1 D 2 D k Following are the steps for Bagging Step 1 For m 1 2 k Create a bootstrap sample D m from Dn Learn a model m based on D m 10 Boosting Bagging and Surrogate Losses 5 Step 2 Compute the final function which is an average of each of the classifiers f x k X m x m 1 Jensen s inequality It states that the convex transformation of a mean is less than or equal to the mean applied after convex transformation f tx1 1 t x2 tf x1 1 t f x2 Figure 3 shows the graphical representation of Jensen s inequality In the context …

View Full Document