ILLINOIS CS 446 - 092817.1 - D3425969

Home> Schools> University of Illinois - urbana> (CS) > CS 446> 092817.1

DOC PREVIEW

ILLINOIS CS 446 - 092817.1

School name University of Illinois - urbana

Course Cs 446- Machine Learning

Pages 7

This preview shows page 1-2 out of 7 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 7 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 7 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 7 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

CS446: Machine Learning, Fall 2017Lecture 10 : Boosting, Bagging and Surrogate LossesLecturer: Sanmi Koyejo Scribe: Triveni Putti, Sep. 28, 20171. Review of previous lecture• Erratum - Error of a random classifier(h) is given as follows:err(h) = 1 −1nnXi=1I[yi=h(xi)]where h(xi) =(1 with probability 0.5−1 otherwiseThe expected value of the error of this classifier is 0.5 i.e.E[error(h)] = 0.5. It waserroneously stated that error(h) = 0.5.• Generally we split the data into two parts: one for training and one for testingBut when we want to select hyper parameters for any machine learning model, such as forpruning in case of decision trees and for regularization parameter in case of linear regression,we split the training set further into actual training and validation sets. Validation setis used to set the hyper parameters. Hence, overall the data is now split into 3 parts:training, validation and test sets.The way of selecting these three datasets should be random to avoid any innate correlationin the order in which data is stored. Generally, we cannot use test set instead of validationset to select the hyper parameters because then, we will be overconfident about the gener-alization error and if there is a new test data, then the parameters should be re-tuned.Same approach is valid for Cross-validation as well. In this case, on a high level, trainingand testing happens but there will be another Cross-validation loop within training forfinding the best hyper parameters. Hence, it is also called ”double Cross-validation” or”inner loop Cross-validation”12 10 : Boosting, Bagging and Surrogate Losses2. BoostingThe goal of boosting is to learn a classifier which is of the form:f(x) =KXm=1αmφm(x)whereαmis the weight andφmis a weak classifier. Consider a 1-D data with distribution asshown below,It is hard to classify using a linear classifier in this case because there is no linear classifier thatwill give the perfect classification. So, let us try to implement boosting algorithm on this data.For the first iteration, the weight,Wn,1will be equal to1nwhich is same for all data points henceit is shown as a straight line in Figure 1. φ1(x) is the result of the first iteration.Now, for step 2, the weights are redistributed. The weight (Wn,2) becomes higher whereverFigure 1: Step 1 of boosting. The weights function is constant for all datapoints.there was a mistake in classifying and lower wherever the classification was correct in Step 1.Hence, the weights are increased for the two wrongly classified positives in the middle. Theclassifier is more concerned about making good predictions in the wrongly classified areas ascompared to others. So,φ2(x) will now predict the wrongly classified positives correctly (butmakes a mistake in the right most negative predictions). On combiningφ1(x) andφ2(x) basedon their corresponding weights, we getf2(X) =α1φ1(x) +α2φ2(x). Its predictions are shown inFigure 2.For step 3, the weight(Wn,3) will be increased for the wrong predictions (right negatives).After few iterations, the final classifier will be more complex and will be able to identify correctlythe boundary between positives and negatives in the middle. The final classifier isfm(x) =KXm=1αm(x)φm(x).Few points to note,•The bias of the final classifier is very low as compared to the bias of the individual classifiers.10 : Boosting, Bagging and Surrogate Losses 3Figure 2: Further steps of boosting•Although conceptually there is a possibility to overfit, practically it does not because eachclassifier is weak and therefore the additional change to the model in every step is fairlysmall. Also since this classifier maximizes the margin, it is unlikely to overfit.• In order to decide how many iterations to perform, validation set should be used.•We are essentially trying to minimize the weighted loss function,Lm(h) =1nPni=1wi,ml(yi, φ(xi))that is fit in everymthstep where the weightswi,mare re-weighted. Note that the weightsare always positive and often normalized.Can I run boosting algorithm on a classifier which cannot add weights to the loss functionresulting in a loss function like Ln(h) =1nPni=1l(yi, φ(xi))? If so, how ?Answer: Yes, it is possible. We need to resample the dataset (sample with replacement) withthe probability of selecting each example i proportional to its re-normalized weight wi.Exercise:Let us say there are n data points (x1, y1),(x2, y2), ....,(xn, yn) with probabilitiesP1, P2, ....Pn4 10 : Boosting, Bagging and Surrogate Lossesrespectively. How do you sample ’n’ points with replacement from the above dataset using theirprobabilities ?Multiclass boosting using AdaboostThe algorithm used here is Stagewise additive modeling using multiclass exponential loss(SAMME) and we follow the same steps as that of binary boosting with only one change.Following are the steps for a dataset with classes that belong to [1,2,...c]:Step1 : Start with choosing wi=1nStep2 : For m = 1,2,...K,• Fit a weak classifier φm(x)to the training set using weights wm• Compute errm=Pni=1wi,mI[yi6=φm(xi)]Pni=1wi,m•Computeαm=log[(1−errmerrm]) +log(c −1) (Adding a bias term here is the only change ascompared to binary boosting)• wi← wiexp[αmI[yi6=φm(xi)]]• Normalize wiStep 3 : Return f(x) =PKm=1αmφm(x)Boosting is an ensemble learner. Ensemble means combination of multiple sub-learners. Boostingis called ensemble learner because it uses a group of weak learners that are combined usingweights.3. BaggingBagging is also a kind of ensemble method. It is an acronym for Bootstrap aggregating.Bootstrap SamplingThis technique is used when we are trying to understand variation in a procedure by re-sampling.It is a trick to emulate random variation in the data.Given a data setDn={(xi, yi)}ni=1, a bootstrap sample can be created by sampling withreplacements uniformly fromDn. Now, let us say we created k such bootstrap samples ofDn,which are calledˆD(1),ˆD(2), ...ˆD(k). Following are the steps for Bagging:Step 1 : For m = 1,2...k– Create a bootstrap sampleˆD(m) from Dn– Learn a model φ(m) based onˆD(m)10 : Boosting, Bagging and Surrogate Losses 5Step 2 : Compute the final function,which is an average of each of the classifiers,f(x) =kXm=1φm(x)Jensen’s inequalityIt states that the convex transformation of a mean is less than or equal to the mean appliedafter convex transformation.f (tx1+ (1 − t)x2) ≤ tf(x1) + (1 − t)f(x2)Figure 3 shows the graphical

View Full Document


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2 out of 7 pages.

ILLINOIS CS 446 - 092817.1

Sign up for free to view:

Please select your school