**Unformatted text preview:**

Ensemble Methods: Boosting Last TimeBoostingBoostingPAC LearningBoostingBoosting: Graphical IllustrationAdaBoostAdaBoostAdaBoostAdaBoostAdaBoostAdaBoostExampleFinal HypothesisBoostingMargins & BoostingMargins & BoostingMargins & BoostingBoosting PerformanceBoosting as OptimizationCoordinate DescentCoordinate DescentAdaBoost as OptimizationBoosting in PracticeBoosting vs. BaggingBoosting Beyond Binary ClassificationEnsemble Methods: BoostingNicholas RuozziUniversity of Texas at DallasBased on the slides of Vibhav Gogate and Rob SchapireLast Time• Variance reduction via bagging• Generate “new” training data sets by sampling with replacement from the empirical distribution• Learn a classifier for each of the newly sampled sets• Combine the classifiers for prediction• Today: how to reduce bias for binary classification problems2Boosting• How to translate rules of thumb (i.e., good heuristics) into good learning algorithms• For example, if we are trying to classify email as spam or not spam, a good rule of thumb may be that emails containing “Nigerian prince” or “business opportunity” are likely to be spam most of the time34Boosting• Freund & Schapire• Theory for “weak learners” in late 80’s• Weak Learner: performance on any training set is slightly better than chance prediction• Intended to answer a theoretical question, not as a practical way to improve learning• Tested in mid 90’s using not-so-weak learners• Works anyway!PAC Learning• Given i.i.d samples from an unknown, arbitrary distribution• “Strong” PAC learning algorithm• For any distribution with high probability given polynomially many samples (and polynomial time) can find classifier with arbitrarily small error • “Weak” PAC learning algorithm• Same, but error only needs to be slightly better than random guessing (e.g., accuracy only needs to exceed 50% for binary classification)• Does weak learnability imply strong learnability?56Boosting1. Weight all training samples equally2. Train model on training set3. Compute error of model on training set4. Increase weights on training cases model gets wrong5. Train new model on re-weighted training set6. Re-compute errors on weighted training set7. Increase weights again on cases model gets wrongRepeat until tiredFinal model: weighted prediction of each modelBoosting: Graphical Illustrationℎ1𝑥𝑥 ℎ2𝑥𝑥 ℎ𝑀𝑀(𝑥𝑥)ℎ 𝑥𝑥 = sign �𝑚𝑚𝛼𝛼𝑚𝑚ℎ𝑚𝑚(𝑥𝑥)7AdaBoost1. Initialize the data weights for the first round as 𝑤𝑤11, … , 𝑤𝑤1𝑀𝑀=1𝑀𝑀2. For 𝑡𝑡 = 1, … , 𝑇𝑇a) Select a classifier ℎ𝑡𝑡for the 𝑇𝑇𝑡𝑡𝑡round by minimizing the weighted error𝜖𝜖𝑡𝑡= �𝑚𝑚𝑤𝑤𝑡𝑡(𝑚𝑚)1𝑡𝑡𝑡𝑥𝑥𝑚𝑚≠𝑦𝑦(𝑚𝑚)b) Compute 𝛼𝛼𝑡𝑡=12ln1 −𝜖𝜖𝑡𝑡𝜖𝜖𝑡𝑡c) Update the weights𝑤𝑤𝑡𝑡+1𝑚𝑚=𝑤𝑤𝑡𝑡𝑚𝑚exp −𝑦𝑦(𝑚𝑚)ℎ𝑡𝑡𝑥𝑥(𝑚𝑚)𝛼𝛼𝑡𝑡2 𝜖𝜖𝑡𝑡⋅ 1 −𝜖𝜖𝑡𝑡8AdaBoost1. Initialize the data weights for the first round as 𝑤𝑤11, … , 𝑤𝑤1𝑀𝑀=1𝑀𝑀2. For 𝑡𝑡 = 1, … , 𝑇𝑇a) Select a classifier ℎ𝑡𝑡for the 𝑇𝑇𝑡𝑡𝑡round by minimizing the weighted error𝜖𝜖𝑡𝑡= �𝑚𝑚𝑤𝑤𝑡𝑡(𝑚𝑚)1𝑡𝑡𝑡𝑥𝑥𝑚𝑚≠𝑦𝑦(𝑚𝑚)b) Compute 𝛼𝛼𝑡𝑡=12ln1 −𝜖𝜖𝑡𝑡𝜖𝜖𝑡𝑡c) Update the weights𝑤𝑤𝑡𝑡+1𝑚𝑚=𝑤𝑤𝑡𝑡𝑚𝑚exp −𝑦𝑦(𝑚𝑚)ℎ𝑡𝑡𝑥𝑥(𝑚𝑚)𝛼𝛼𝑡𝑡2 𝜖𝜖𝑡𝑡⋅ 1 −𝜖𝜖𝑡𝑡Weighted number of incorrect classifications of the 𝑡𝑡𝑡𝑡𝑡classifier 9AdaBoost1. Initialize the data weights for the first round as 𝑤𝑤11, … , 𝑤𝑤1𝑀𝑀=1𝑀𝑀2. For 𝑡𝑡 = 1, … , 𝑇𝑇a) Select a classifier ℎ𝑡𝑡for the 𝑇𝑇𝑡𝑡𝑡round by minimizing the weighted error𝜖𝜖𝑡𝑡= �𝑚𝑚𝑤𝑤𝑡𝑡(𝑚𝑚)1𝑡𝑡𝑡𝑥𝑥𝑚𝑚≠𝑦𝑦(𝑚𝑚)b) Compute 𝛼𝛼𝑡𝑡=12ln1 −𝜖𝜖𝑡𝑡𝜖𝜖𝑡𝑡c) Update the weights𝑤𝑤𝑡𝑡+1𝑚𝑚=𝑤𝑤𝑡𝑡𝑚𝑚exp −𝑦𝑦(𝑚𝑚)ℎ𝑡𝑡𝑥𝑥(𝑚𝑚)𝛼𝛼𝑡𝑡2 𝜖𝜖𝑡𝑡⋅ 1 −𝜖𝜖𝑡𝑡𝜖𝜖𝑡𝑡→ 0𝛼𝛼𝑡𝑡→ ∞10AdaBoost1. Initialize the data weights for the first round as 𝑤𝑤11, … , 𝑤𝑤1𝑀𝑀=1𝑀𝑀2. For 𝑡𝑡 = 1, … , 𝑇𝑇a) Select a classifier ℎ𝑡𝑡for the 𝑇𝑇𝑡𝑡𝑡round by minimizing the weighted error𝜖𝜖𝑡𝑡= �𝑚𝑚𝑤𝑤𝑡𝑡(𝑚𝑚)1𝑡𝑡𝑡𝑥𝑥𝑚𝑚≠𝑦𝑦(𝑚𝑚)b) Compute 𝛼𝛼𝑡𝑡=12ln1 −𝜖𝜖𝑡𝑡𝜖𝜖𝑡𝑡c) Update the weights𝑤𝑤𝑡𝑡+1𝑚𝑚=𝑤𝑤𝑡𝑡𝑚𝑚exp −𝑦𝑦(𝑚𝑚)ℎ𝑡𝑡𝑥𝑥(𝑚𝑚)𝛼𝛼𝑡𝑡2 𝜖𝜖𝑡𝑡⋅ 1 −𝜖𝜖𝑡𝑡𝜖𝜖𝑡𝑡→ .5𝛼𝛼𝑡𝑡→ 011AdaBoost1. Initialize the data weights for the first round as 𝑤𝑤11, … , 𝑤𝑤1𝑀𝑀=1𝑀𝑀2. For 𝑡𝑡 = 1, … , 𝑇𝑇a) Select a classifier ℎ𝑡𝑡for the 𝑇𝑇𝑡𝑡𝑡round by minimizing the weighted error𝜖𝜖𝑡𝑡= �𝑚𝑚𝑤𝑤𝑡𝑡(𝑚𝑚)1𝑡𝑡𝑡𝑥𝑥𝑚𝑚≠𝑦𝑦(𝑚𝑚)b) Compute 𝛼𝛼𝑡𝑡=12ln1 −𝜖𝜖𝑡𝑡𝜖𝜖𝑡𝑡c) Update the weights𝑤𝑤𝑡𝑡+1𝑚𝑚=𝑤𝑤𝑡𝑡𝑚𝑚exp −𝑦𝑦(𝑚𝑚)ℎ𝑡𝑡𝑥𝑥(𝑚𝑚)𝛼𝛼𝑡𝑡2 𝜖𝜖𝑡𝑡⋅ 1 −𝜖𝜖𝑡𝑡𝜖𝜖𝑡𝑡→ 1𝛼𝛼𝑡𝑡→ −∞12AdaBoost1. Initialize the data weights for the first round as 𝑤𝑤11, … , 𝑤𝑤1𝑀𝑀=1𝑀𝑀2. For 𝑡𝑡 = 1, … , 𝑇𝑇a) Select a classifier ℎ𝑡𝑡for the 𝑇𝑇𝑡𝑡𝑡round by minimizing the weighted error𝜖𝜖𝑡𝑡= �𝑚𝑚𝑤𝑤𝑡𝑡(𝑚𝑚)1𝑡𝑡𝑡𝑥𝑥𝑚𝑚≠𝑦𝑦(𝑚𝑚)b) Compute 𝛼𝛼𝑡𝑡=12ln1 −𝜖𝜖𝑡𝑡𝜖𝜖𝑡𝑡c) Update the weights𝑤𝑤𝑡𝑡+1𝑚𝑚=𝑤𝑤𝑡𝑡𝑚𝑚exp −𝑦𝑦(𝑚𝑚)ℎ𝑡𝑡𝑥𝑥(𝑚𝑚)𝛼𝛼𝑡𝑡2 𝜖𝜖𝑡𝑡⋅ 1 −𝜖𝜖𝑡𝑡13Normalization constantExample• Consider a classification problem where vertical and horizontal lines (and their corresponding half spaces) are the weak

View Full Document