BOOSTING (ADABOOST ALGORITHM) Eric EmerConsider Horse-Racing Gambler • Rules of Thumb for determining Win/Loss: • Most favored odds • Fastest recorded lap time • Most wins recently, say, in the past 1 month • Hard to determine how he combines analysis of feature set into a single bet.Consider MIT Admissions • 2-class system (Admit/Deny) • Both Quantitative Data and Qualitative Data • We consider (Y/N) answers to be Quantitative (-1,+1) • Region, for instance, is qualitative.Rules of Thumb, Weak Classifiers • Easy to come up with rules of thumb that correctly classify the training data at better than chance. • E.g. IF “GoodAtMath”==Y THEN predict “Admit”. • Difficult to find a single, highly accurate prediction rule. This is where our Weak Learning Algorithm, AdaBoost, helps us.What is a Weak Learner? • For any distribution, with high probability, given polynomially many examples and polynomial time we can find a classifier with generalization error better than random guessing. ✏ <12, also denoted > 0 for generalization error (12 )Weak Learning Assumption • We assume that our Weak Learning Algorithm (Weak Learner) can consistently find weak classifiers (rules of thumb which classify the data correctly at better than 50%) • Given this assumption, we can use boosting to generate a single weighted classifier which correctly classifies our training data at 99%-100%.AdaBoost Specifics • How does AdaBoost weight training examples optimally? • Focus on difficult data points. The data points that have been misclassified most by the previous weak classifier. • How does AdaBoost combine these weak classifiers into a comprehensive prediction? • Use an optimally weighted majority vote of weak classifier.AdaBoost Technical Description Missing details: How to generate distribution? How to get single classifier?Constructing Dt D1(i)=1mand given Dtand ht:Dt+1=Dt(i)Ztc(x)c(x)=⇢e↵t: yi= ht(xi)e↵t: yi6= ht(xi)Dt+1=Dt(i)Zte↵tyiht(xi)where Zt= normalization constant↵t=12ln1 ✏t✏t> 0Getting a Single Classifier Hfinal(x) = sign(Xt↵tht(x))Mini-ProblemTraining Error Analysis Claim: then,Proof • Step 1: unwrapping the recurrence • Step 2: Show • Step 3: Show training error(Hfinal) QtZtZt=2p✏t(1 ✏t)How might test error react to AdaBoost? We expect to encounter: • Occam’s Razor • OverfittingEmpirical results of test error •Test error does not increase even after 1000 rounds. •Test error continues to drop after training error reaches zero.Difference from Expectation: The Margins Explanation • Our training error only measures correctness of classifications, neglects confidence of classifications. How can we measure confidence of classifications? • Margin(x,y) close to +1 is high confidence, correct. • Margin(x,y) close to -1 is high confidence, incorrect. • Margin(x,y) close to 0 is low confidence. Hfinal(x) = sign(f(x))f(x)=Pt↵thtPt↵t2 [1, 1]margin(x, y)=yf(x)Empirical Evidence Supporting Margins Explanation Cumulative distribution of margins on training examples Hfinal(x) = sign(f(x))f(x)=Pt↵thtPt↵t2 [1, 1]margin(x, y)=yf(x)Pros/Cons of AdaBoost Pros • Fast • Simple and easy to program • No parameters to tune (except T) • No prior knowledge needed about weak learner • Provably effective given Weak Learning Assumption • versatile Cons • Weak classifiers too complex leads to overfitting. • Weak classifiers too weak can lead to low margins, and can also lead to overfitting. • From empirical evidence, AdaBoost is particularly vulnerable to uniform noise.Predicting College Football Results Training Data: 2009 NCAAF Season Test Data: 2010 NCAAF
View Full Document