CMU CS 10701 - Lecture - D2183097

Home> Schools> Carnegie Mellon University> Computer Science (CS) > CS 10701> Lecture

DOC PREVIEW

CMU CS 10701 - Lecture

School name Carnegie Mellon University

Course Cs 10701- Introduction to Machine Learning

Pages 13

This preview shows page 1-2-3-4 out of 13 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 13 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Machine Learning 1010 701 15701 15 781 Spring 2008 Boosting from Weak Learners Eric Xing Lecture 9 February 13 2008 Reading Chap 14 3 C B book Rationale Combination of methods z There is no algorithm that is always the most accurate z We can select simple weak classification or regression methods and combine them into a single strong method z Different learners use different z z Algorithms z Hyperparameters z Representations Modalities z Training sets z Subproblems The problem how to combine them 1 Some early algorithms z z z Boosting by filtering Schapire 1990 z Run weak learner on differently filtered example sets z Combine weak hypotheses z Requires knowledge on the performance of weak learner Boosting by majority Freund 1995 z Run weak learner on weighted example set z Combine weak hypotheses linearly z Requires knowledge on the performance of weak learner Bagging Breiman 1996 z Run weak learner on bootstrap replicates of the training set z Average weak hypotheses z Reduces variance Combination of classifiers z Suppose we have a family of component classifiers generating 1 labels such as decision stumps h x sign wxk b where k w b z Each decision stump pays attention to only a single component of the input vector 2 Combination of classifiers con d z We d like to combine the simple classifiers additively so that the final classifier is the sign of h x 1 h x 1 K m h x m where the votes i emphasize component classifiers that make more reliable predictions than others z Important issues z what is the criterion that we are optimizing measure of loss z we would like to estimate each new component classifier in the same manner modularity Measurement of error z Loss function e g I y h x y h x z Generalization error L h E y h x z Objective find h with minimum generalization error z Main boosting idea minimize the empirical error 1 L h N N y h x n 1 n n 3 Exponential Loss z One possible measure of empirical loss is exp y h n i m i 1 xi n exp yi h m 1 x i yi am h x i m i 1 n exp yi h m 1 x i exp yi am h x i m i 1 n Wi m 1 exp yi am h x i m i 1 z The combined classifier based on m 1 iterations defines a weighted loss criterion for the next simple classifier to add z each training sample is weighted by its classifiability or difficulty seen by the classifier we have built so far Linearization of loss function z We can simplify a bit the estimation criterion for the new component classifiers assuming is small exp yi am h x i m 1 yi am h x i m z Now our empirical loss criterion reduces to exp y h n i m i 1 x i n Wi m 1 1 yi am h x i m i 1 n n i 1 i 1 Wi m 1 am Wi m 1 yi h x i m z We could choose a new component classifier to optimize this weighted agreement 4 A possible algorithm z At stage m we find that maximize or at least give a sufficiently high weighted agreement n W i i 1 z m 1 yi h x i m z each sample is weighted by its difficulty under the previously combined m 1 classifiers z more difficult samples received heavier attention as they dominates the total loss Then we go back and find the votes m associated with the new classifier by minimizing the original weighted exponential loss n W i 1 i m 1 exp yi am h x i m Boosting z We have basically derived a Boosting algorithm that sequentially adds new component classifiers each trained on reweighted training examples z z each component classifier is presented with a slightly different problem AdaBoost preliminaries z we work with normalized weights Wi on the training examples initially uniform Wi 1 n z the weight reflect the degree of difficulty of each datum on the latest classifier 5 The AdaBoost algorithm z At the kth iteration we find any classifier h x k for which the weighted classification error 1 n 2 i 1 k 0 5 Wi k 1 yi h x i k is better than change z z Determine how many votes to assign to the new component classifier k 0 5 log 1 k k z z This is meant to be easy weak classifier stronger classifier gets more votes Update the weights on the training examples Wi k Wi k 1 exp yi ak h x i k The AdaBoost algorithm cont d z The final classifier after m boosting iterations is given by the sign of h x 1 K m h x m h x 1 1 K m z the votes here are normalized for convenience 6 AdaBoost summary z Input z N examples SN x1 y1 xN yN z a weak base learner h h x z Initialize equal example weights wi 1 N for all i 1 N z Iterate for t 1 T 1 z train base learner according to weighted example set wt x and obtain hypothesis ht h x t 2 compute hypothesis error t 3 compute hypothesis weight t 4 update example weights for next iteration wt 1 Output final hypothesis as a linear combination of ht AdaBoost dataflow diagram w1 A w S w2 A w S wT A w S 7 Boosting examples Boosting example cont d 8 Boosting example cont d Base Learners z z Weak learners used in practice z Decision stumps axis parallel splits z Decision trees e g C4 5 by Quinlan 1996 z Multi layer neural networks z Radial basis function networks Can base learners operate on weighted examples z In many cases they can be modified to accept weights along with the examples z In general we can sample the examples with replacement according to the distribution defined by the weights 9 Boosting performance z The error rate of component classifier the decision stumps does not improve much if at all over time z But both training and testing error improve over time z Even after the training error of the combined classifier goes to zero boosting iterations can still improve the generalization error Why it is working z You will need some learning theory to be covered in the next two lectures to understand this fully but for now let s just go over some high level ideas z Generalization Error With high probability Generalization error is less than As T goes up our bound becomes worse Boosting should overfit 10 Experiments Test error Training error The Boosting Approach to Machine Learning by Robert E Schapire Training Margins z When a vote is taken the more predictors agreeing the more confident you are in your prediction z Margin for example h x K m h x i m margin h x i yi yi 1 i …

View Full Document

CMU CS 10701 - Lecture

Sign up for free to view:

This document and 3 million+ documents and flashcards
High quality study guides, lecture notes, practice exams
Course Packets handpicked by editors offering a comprehensive review of your courses
Better Grades Guaranteed


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-3-4 out of 13 pages.

CMU CS 10701 - Lecture

Sign up for free to view:

Please select your school