Ensemble Methods for Machine Learning Outline Motivations and techniques Bias variance bagging Combining learners vs choosing between them bucket of models stacking blending Pac learning theory boosting Relation of boosting to other learning methods optimization SVMs Why Bias and Variance Motivate Ensembles Bias and variance For regression we can decompose the error of the learned regression into two terms bias and variance Bias the class of models can t fit the data Fix a more expressive model class Variance the class of models could fit the data but doesn t because it s hard to fit Fix a less expressive model class Bias Variance decomposition of error ED dataset and noise f x h D x true function 2 learned from D Fix test case x then do this experiment 1 Draw size n sample D x1 y1 xn yn 2 Train linear regressor hD using D 3 Draw the test example x f x 4 Measure squared error of hD on that example x 5 Bias Variance decomposition of error 2 E f y E h ED hD x y y D hD x f h h y 2 2 2 f h h y E f h 2 h y 2 2 E fh E fy E h2 E hy E f h 2 E h y Squared difference between best possible prediction for x f x and our long term expectation for what the learner will do if we averaged over many datasets D ED hD x 6 VARIANCE Squared difference btwn our longterm expectation for the learners performance ED hD x and what we expect in a representative run on a dataset D hat y BIAS2 Bias variance decomposition This is something real that you can approximately measure experimentally if you have synthetic data or a large dataset Different learners and model classes have different tradeoffs large bias small variance few features highly regularized highly pruned decision trees small bias high variance many features less regularization unpruned trees Bias and variance Domingos A Unified Bias Variance Decomposition and its Applications ICML 2000 For classification we can also decompose the error of a learned classifier into two terms bias and variance Bias the class of models can t fit the data Fix a more expressive model class Variance the class of models could fit the data but doesn t because it s hard to fit Fix a less expressive model class How else can we reduce variance Bias and variance Beside tradeoffs How can we reduce variance If we want to reduce standard error sd sqrt n of a sample what can we do Get more samples Here the experiment is a dataset D and a classifier hD Can we get more samples of the experiment Idea 1 randomize the algorithm somehow and run it many times Idea 2 randomize the dataset somehow Bootstrap sampling Input dataset D Output many variants of D D1 DT For t 1 T Dt For i 1 D Pick x y uniformly at random from D i e with replacement and add it to Dt Some examples never get picked 37 Some are picked 2x 3x Bootstrap Aggregation Bagging Use the bootstrap to create T variants of D Learn a classifier from each variant Vote the learned classifiers to predict on a test example Bagging bootstrap aggregation Breaking it down input dataset D and YFCL output a classifier hD BAG Note that you can use any learner you like use bootstrap to construct variants D1 DT for t 1 T train YFCL on Dt to get ht You can also test ht on the out of bag examples to classify x with hD BAG classify x with h1 hT and predict the most frequently predicted class for x majority vote Bagging bootstrap aggregation Experimentally even with pruning decision trees have low bias but high variance bagging usually improves performance for decision trees and similar methods it doesn t help for certain other methods e g nearest neighbor which have low variance It reduces variance without increasing the bias Results Combining Versus Choosing Between Classifiers Scenario 1 You re the chief scientist at the hot new startup nothingbutfrogs com Kleiner Perkins just gave your new company 20M in funding but you need to stay ahead of the competition you hire Kevin s sister brother and six of his first cousins and put them to work Each tries a different learner and you pick the best one or can you do better Scenario 2 You re building a new classifier learner for class The professor just gave the class tens of thousands of dollars of grant money for time on Amazon s EC2 cluster does that help you COMBINING MULTIPLE LEARNERS OR MULTIPLE PEOPLE S WORK ON BUILDING TUNING LEARNERS Simplest approach A bucket of models Input your top T favorite learners or tunings L1 LT A dataset D Learning algorithm Use 10 CV to estimate the error of L1 LT Pick the best lowest 10 CV error learner L Train L on D and return its hypothesis h Pros and cons of a bucket of models Pros Simple Will give results not much worse than the best of the base learners Cons What if there s not a single best learner Other approaches Vote the hypotheses how would you weight them Combine them some other way How about learning to combine the hypotheses Stacked learners first attempt Problem if Li overfits the data D Input then hi x could be almost your top T favorite learners or tunings always the same as y in D L1 LT A dataset D containing x y But that won t be the case on an Learning algorithm test example Train L1 LT on D to getout of sample the h1 hT Create a new dataset D x containing x y x is a vector of the T predictions h1 x hT x y is the label y for x The fix make an x in D look likecombines the out of sample test Train YFCL on D to get h more which the predictions cases To predict on a new x Construct x as before and predict h x Stacked learners the right way Input D x or is tunings the CV training for x your top T favorite learners L1 Lset T Li D x A dataset D containing x y is the label predicted by the hypothesis learned from D by Li Learning algorithm Train L1 LT on D to get h1 hT Also run 10 CV and collect the CV test predictions x Li D x x for each example x and each learner Li Create a new dataset D containing x y x is the CV test predictions L1 D x x L2 D x x y is the label y for x Train YFCL on D to get h which combines the predictions To predict on a new x we don t do any CV at prediction time Construct x using h1 hT as before and predict h x Pros and cons of stacking Pros Fairly simple Slow but easy to parallelize Cons What if there s not a single …
View Full Document
Unlocking...