Variance Reduction and Ensemble Methods Nicholas Ruozzi University of Texas at Dallas Based on the slides of Vibhav Gogate and David Sontag Last Time PAC learning Bias variance tradeoff bias variance small hypothesis spaces not enough flexibility can have high rich hypothesis spaces too much flexibility can have high Today more on this phenomenon and how to get around it 2 Intuition Bias Measures the accuracy or quality of the algorithm High bias means a poor match Variance Measures the precision or specificity of the match High variance means a weak match We would like to minimize each of these Unfortunately we can t do this independently there is a trade off 3 Bias Variance Analysis in Regression True function is standard deviation Where noise is normally distributed with zero mean and Given a set of training examples fit a hypothesis squared error we to the data to minimize the 1 1 2 4 2 D Example Sample 20 points from 2 sin 1 5 0 0 2 5 2 D Example 50 fits 20 examples each 6 Bias Variance Analysis Given a new data point with observed value want to understand the expected prediction error Suppose that training samples are drawn independently from a want to compute the expected error of the distribution estimator 2 7 Probability Reminder Variance of a random variable 2 2 2 2 Properties of 2 2 2 2 2 2 8 Bias Variance Noise Decomposition 2 2 2 2 2 2 2 2 2 2 2 2 2 9 Bias Variance Noise Decomposition The samples and the noise are independent 2 2 2 2 2 2 2 2 2 2 2 2 2 10 Bias Variance Noise Decomposition 2 2 2 2 Follows from definition of variance 2 2 2 2 2 2 2 2 2 11 Bias Variance Noise Decomposition 2 2 2 2 2 2 2 2 2 2 2 2 2 12 Bias Variance Noise Decomposition 2 2 2 2 2 2 2 Variance Bias 2 2 2 2 2 2 Noise 13 Bias Variance and Noise Describes how much varies from one training set to Variance another Bias 2 Describes the average error of Noise 2 Describes how much 2 2 varies from 14 2 D Example 50 fits 20 examples each 15 Bias 16 Variance 17 Noise 18 Bias Low bias High bias 19 Bias Low bias High bias Linear regression applied to linear data 2nd degree polynomial applied to quadratic data Constant function applied to non constant data Linear regression applied to highly non linear data 20 Variance Low variance High variance 21 Variance Low variance Constant function Model independent of training data High variance High degree polynomial 22 Bias Variance Tradeoff bias2 variance is what counts for prediction As we saw in PAC learning we often have Low bias high variance Low variance high bias How can we deal with this in practice 23 Reduce Variance Without Increasing Bias Averaging reduces variance let be i i d random variables 1 Idea average models to reduce model variance 1 1 The problem Only one training set Where do multiple models come from 24 Bagging Bootstrap Aggregation Take repeated bootstrap samples from training set Breiman Bootstrap sampling Given set containing training examples by drawing examples at random with replacement 1994 create from Bagging Create bootstrap samples Train distinct classifier on each 1 Classify new instance by majority vote average 25 Bagging Bootstrap Aggregation image from the slides of David Sontag 26 Bagging Data BS 1 BS 2 BS 3 1 7 8 2 1 1 3 9 3 4 10 1 5 7 1 6 8 9 7 8 7 8 4 4 5 7 Build a classifier from each bootstrap sample 8 4 8 2 5 5 In each bootstrap sample each data point has 9 7 10 8 10 2 1 8 probability of not being selected 1 Expected number of distinct data points in each sample is then 1 1 1 1 exp 1 632 1 27 Bagging Data BS 1 BS 2 BS 3 1 7 8 2 1 1 3 9 3 4 10 1 5 7 1 6 8 9 7 8 7 8 4 4 5 7 Build a classifier from each bootstrap sample 4 8 2 8 5 5 In each bootstrap sample each data point has probability of not being selected 9 7 10 8 10 2 1 8 1 If we have 1 TB of data each bootstrap sample will be 632GB this can present computational challenges 1 28 Decision Tree Bagging image from the slides of David Sontag 29 Decision Tree Bagging 100 Bagged Trees image from the slides of David Sontag 30 Bagging Results Without Bagging With Bagging Breiman Bagging Predictors Berkeley Statistics Department TR 421 1994 32 Random Forests 33 Random Forests Ensemble method specifically designed for decision tree classifiers Introduce two sources of randomness bagging and random input vectors Bagging method each tree is grown using a bootstrap sample of training data Random vector method best split at each node is chosen from a random sample of attributes attributes instead of all 34 Random Forest Algorithm For to 1 from the data Draw a bootstrap sample of size Grow a tree Choose Choose the best attribute among the Split on the best attribute and recurse until partitions attributes uniformly at random from the data using the bootstrap sample as follows to split on have fewer than number of nodes Prediction for a new data point Regression Classification choose the majority class label among 1 1 35 Random Forest Demo A demo of random forests implemented in JavaScript 36 When Will Bagging Improve Accuracy Depends on the stability of the base level classifiers A learner is unstable if a small change to the training set causes a large change in the output hypothesis If small changes in cause large changes in the output then there will likely be an improvement in performance with bagging Bagging can help unstable procedures but could hurt the performance of stable procedures Decision trees are unstable nearest neighbor is stable 37
View Full Document