Unformatted text preview:

Ensemble Methods for Machine Learning The Ensemble Strikes Back Outline Motivations and techniques Bias variance bagging Combining learners vs choosing between them bucket of models stacking blending Pac learning theory boosting Relation of boosting to other learning methods optimization SVMs Review Of Boosting Sample with replacement Increase weight of xi if ht is wrong decrease weight if ht is right Linear combination of base hypotheses best weight t depends on error of ht Boosting A toy example Thanks Rob Schapire Thanks Rob Schapire Boosting A toy example Thanks Rob Schapire Boosting A toy example Thanks Rob Schapire Boosting A toy example Thanks Rob Schapire Boosting A toy example Boosting improved decision trees 1950 T 1984 V 1988 KV 1989 S 1993 DSS 1995 FS Analysis Of Boosting Theorem 1 error rate Theorem 1 error rate Proof sign f x where upper bound on error on i QED Theorem 1 So pick h s and s to minimize Z s Simplified notation drop the t s let ui yiht xi remember that ui 1 or 1 Claim 1 0 0 1 sign f x where ui 1 ui 1 equality for u 1 1 imequality holds for 1 u 1 So let s minimize f to pick a best Minimize f 1 a a a a f a D i e u e e u e i i 2 i 1 a a 1 a a D i ui e e D i e e i 2 2 i r a a 1 a a e e D i e e 2 2 i sign f x where 1 r 1 r a f a e 1 e 0 2 2 a r a a e a ea e e 2 2 D i r a a e a ea e e 2 2 1 r a 1 r e a e 2 2 i Theorem 1 So pick h s and s to minimize Z s Theorem 2 when for then Zt 1 rt2 and hence training error is bounded by Comment if h x 1 then Boosting as Optimization Even boosting single features worked well 1950 T 1984 V 1988 KV 1989 S 1993 DSS 1995 FS Reuters newswire corpus Some background facts 1950 T 1984 V 1988 KV 1989 S 1993 DSS 1995 FS Coordinate descent optimization to minimize f w For t 1 T or till convergence For i 1 N where w w1 wN Pick w to minimize f w1 wi 1 w wi 1 wN Set wi w Boosting as optimization using coordinate descent With a small number of possible h s you can think of boosting as finding a linear combination of these H x sign wi hi x i So boosting is sort of like stacking h x h1 x hN x w a1 a N stacked instance vector weight vector Boosting uses coordinate descent to minimize an upper bound on error rate exp yi wi hi x i t i Boosting and optimization 1950 T Jerome Friedman Trevor Hastie and Robert Tibshirani Additive logistic regression a statistical view of boosting The Annals of Statistics 2000 1984 V 1988 KV 1989 S 1993 DSS 1995 FS 1999 FHT Compared using AdaBoost to set feature weights vs direct optimization of feature weights to minimize loglikelihood squared error Boosting as Margin Learning Boosting didn t seem to overfit 1950 T test error 1984 V 1988 KV 1989 S 1993 DSS 1995 FS train error because it turned out to be increasing the margin of the classifier 1950 T 1000 rounds 1984 V 1988 KV 1989 S 1993 DSS 1995 FS 100 rounds Boosting movie Some background facts 1950 T 1984 V 1988 KV 1989 S 1993 DSS 1995 FS Coordinate descent optimization to minimize f w For t 1 T or till convergence For i 1 N where w w1 wN Pick w to minimize f w1 wi 1 w wi 1 wN Set wi w w k k k k T w1 w w 2 w21 wT2 w 1 w1 wT w max w1 wT Boosting is closely related to margin classifiers like SVM voted perceptron 1950 T 1984 V 1988 KV 1989 S 1993 DSS 1995 FS h x h1 x hT x stacked instance vector w a1 a N weight vector Boosting w h x y max w min x w 1 h x optimized by coordinate descent here w 1 a t and h x max t ht x t The coordinates are being extended by one in each round of boosting usually unless you happen to generate the same tree twice Boosting is closely related to margin classifiers like SVM voted perceptron 1950 T 1984 V 1988 KV 1989 S 1993 DSS 1995 FS h x h1 x hT x stacked instance vector w a1 a N weight vector Boosting w h x y max w min x w 1 h x optimized by coordinate descent here w 1 a t and h x max t ht x t Linear SVMs w h x y max w min x w 2 h x 2 where w2 optimized by QP 2 a and h x h x t t 2 2 t Wrapup On Boosting Boosting in the real world 1950 T 1984 V 1988 KV 1989 S 1993 DSS 1995 FS now William s wrap up Boosting is not discussed much in the ML research community any more It s much too well understood It s really useful in practice as a meta learning method Eg boosted Na ve Bayes usually beats Na ve Bayes Boosted decision trees are almost always competitive with respect to accuracy very robust against rescaling numeric features extra features non linearities somewhat slower to learn and use than many linear classifiers But getting probabilities out of them is a little less reliable


View Full Document

CMU MLG 10601 - ensembles2

Loading Unlocking...
Login

Join to view ensembles2 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view ensembles2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?