DOC PREVIEW
Pitt CS 2750 - Ensemble methods

This preview shows page 1-2-3-4-5 out of 15 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 15 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 15 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 15 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 15 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 15 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 15 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

1CS 2750 Machine LearningCS 2750 Machine LearningLecture 23Milos [email protected] Sennott SquareEnsemble methods.Bagging and BoostingCS 2750 Machine LearningAdministrative announcements• Term projects:– Reports due on Wednesday, April 21, 2004 at 12:30pm. – Presentations on Wednesday, April 21, 2004 at 12:30pm.• Quiz– Wednesday, April 14, 2003– Closed book– Short (~30 minutes)– Main ideas of methods covered after the midterm• EM, Dimensionality reduction, Clustering, Decision trees, Mixtures of experts, Bagging and Boosting, Reinforcement learning.2CS 2750 Machine LearningEnsemble methods• Mixture of experts– Multiple ‘base’ models (classifiers, regressors), each covers a different part (region) of the input space• Committee machines:– Multiple ‘base’ models (classifiers, regressors), each covers the complete input space– Each base model is trained on a slightly different train set – Combine predictions of all models to produce the output• Goal: Improve the accuracy of the ‘base’ model– Methods:• Bagging• Boosting• Stacking (not covered)CS 2750 Machine LearningBagging (Bootstrap Aggregating)• Given:– Training set of N examples– A class of learning models (e.g. decision trees, neural networks, …)• Method:– Train multiple (k) models on different samples (data splits) and average their predictions– Predict (test) by averaging the results of k models • Goal:– Improve the accuracy of one model by using its multiple copies– Average of misclassification errors on different data splits gives a better estimate of the predictive ability of a learning method3CS 2750 Machine LearningBagging algorithm• Training– In each iteration t, t=1,…T• Randomly sample with replacement N samples from the training set• Train a chosen “base model” (e.g. neural network, decision tree) on the samples• Test– For each test example• Start all trained base models• Predict by combining results of all T trained models:– Regression: averaging– Classification: a majority voteCS 2750 Machine LearningSimple Majority VotingFinalClass “yes”H1H3Test examplesClass “no”H24CS 2750 Machine Learning• Expected error= Bias+Variance– Expected error is the expected discrepancy between the estimated and true function– Bias is squared discrepancy between averagedestimated and true function– Variance is expected divergence of the estimated function vs. its average valueAnalysis of Bagging() ()[]()[]2ˆXfEXfE −()[]()[]()2ˆXfEXfE −() ()[]()[]2ˆˆXfEXfE −CS 2750 Machine LearningWhen Bagging works?Under-fitting and over-fitting• Under-fitting:– High bias (models are not accurate)– Small variance (smaller influence of examples in the training set)• Over-fitting:– Small bias (models flexible enough to fit well to training data)– Large variance (models depend very much on the training set)5CS 2750 Machine LearningAveraging decreases variance• Example– Assume we measure a random variable x with a N(µ,σ2) distribution– If only one measurement x1is done,• The expected mean of the measurement is µ• Variance is Var(x1)=σ2– If random variable x is measured K times (x1,x2,…xk) and the value is estimated as: (x1+x2+…+xk)/K, • Mean of the estimate is still µ• But, variance is smaller:– [Var(x1)+…Var(xk)]/K2=Kσ2 / K2 = σ2/K• Observe: Bagging is a kind of averaging!CS 2750 Machine LearningWhen Bagging works • Main property of Bagging (proof omitted)– Bagging decreases variance of the base model without changing the bias!!!– Why? averaging!• Bagging typically helps– When applied with an over-fitted base model• High dependency on actual training data• It does not help much– High bias. When the base model is robust to the changes in the training data (due to sampling)6CS 2750 Machine LearningBoosting. Theoretical foundations.• PAC: Probably Approximately Correct framework– (ε-δ) solution• PAC learning:– Learning with the pre-specified accuracy ε and confidence δ– the probability that the misclassification error is larger than ε is smaller than δ• Accuracy (ε ): Percent of correctly classified samples in test• Confidence (δ ): The probability that in one experiment some accuracy will be achievedδε≤> ))(( cMEPCS 2750 Machine LearningPAC LearnabilityStrong (PAC) learnability:• There exists a learning algorithm that efficiently learns the classification with a pre-specified accuracy and confidenceStrong (PAC) learner: • A learning algorithm P that given an arbitrary– classification error ε (<1/2), and– confidence δ (<1/2)• Outputs a classifier – With a classification accuracy > (1-ε) – A confidence probability > (1- δ)– And runs in time polynomial in 1/ δ, 1/ε• Implies: number of samples N is polynomial in 1/ δ, 1/ε7CS 2750 Machine LearningWeak LearnerWeak learner:• A learning algorithm (learner) W– Providing classification accuracy >1-εo– With probability >1- δo• For some fixed and uncontrollable– classification error εo(<1/2) – confidence δo(<1/2)And this on an arbitrary distribution of data entriesCS 2750 Machine LearningWeak learnability=Strong (PAC) learnability• Assume there exists a weak learner– it is better that a random guess (50 %) with confidence higher than 50 % on any data distribution• Question:– Is problem also PAC-learnable?– Can we generate an algorithm P that achieves an arbitrary (ε-δ) accuracy?• Why is important?– Usual classification methods (decision trees, neural nets), have specified, but uncontrollable performances. – Can we improve performance to achieve pre-specified accuracy (confidence)?8CS 2750 Machine LearningWeak=Strong learnability!!!• Proof due to R. SchapireAn arbitrary (ε-δ) improvement is possibleIdea: combine multiple weak learners together– Weak learner W with confidence δoand maximal error εo– It is possible:• To improve (boost) the confidence• To improve (boost) the accuracyby training different weak learners on slightly different datasetsCS 2750 Machine LearningBoosting accuracyTrainingDistribution samplesH1and H2 classify differentlyCorrect classificationWrong classificationH3H1H2Learners9CS 2750 Machine LearningBoosting accuracy• Training– Sample randomly from the distribution of examples – Train hypothesis H1.on the sample– Evaluate accuracy of H1on the distribution– Sample randomly such that for


View Full Document

Pitt CS 2750 - Ensemble methods

Download Ensemble methods
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Ensemble methods and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Ensemble methods 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?