DOC PREVIEW
Pitt CS 2750 - Machine Learning

This preview shows page 1-2-3-4-5 out of 14 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 14 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 14 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 14 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 14 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 14 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 14 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

1CS 2750 Machine LearningCS 2750 Machine LearningLecture 21Milos [email protected] Sennott SquareEnsemble methods.Boosting CS 2750 Machine LearningAdministrative announcements• Term projects:– Reports due on Wednesday, April 23, 2003 at 2pm. – Presentations on Friday, April 25, 2003 at 1pm.• Quiz– Monday, April 21, 2003– Closed book– Short (~30 minutes)– Main ideas of methods covered after the midterm• EM, Dimensionality reduction, Clustering, Non-parametric methods, Decision trees, Mixtures of experts, Bagging and Boosting, Reinforcement learning.2CS 2750 Machine LearningEnsemble methods• Mixture of experts– Different ‘base’ models (classifiers, regressors) cover different parts of the input space• Committee machines:– Train several ‘base’ models on the complete input space, but on slightly different train sets – Combine their decision to produce the final result– Goal: Improve the accuracy of the ‘base’ model– Methods:• Bagging• Boosting• Stacking (not covered)CS 2750 Machine LearningBagging (Bootstrap Aggregating)• Given:– Training set of N examples– A class of learning models (e.g. decision trees, neural networks, …)• Method:– Train multiple (k) models on different samples (data splits) and average their predictions– Predict (test) by averaging the results of k models • Goal:– Improve the accuracy of one model by using multiple copies of the model– Average of misclassification errors on different data splits gives a better estimate of the predictive ability of a learning method3CS 2750 Machine LearningSimple Majority VotingFinalClass “yes”H1H3Test examplesClass “no”H2CS 2750 Machine Learning• Expected error= Bias+Variance– Expected error is the expected discrepancy between the estimated and true function– Bias is squared discrepancy between averagedestimated and true function– Variance is expected divergence of the estimated function vs. its average valueAnalysis of Bagging() ()[]()[]2ˆXfEXfE −()[]()[]()2ˆXfEXfE −() ()[]()[]2ˆˆXfEXfE −4CS 2750 Machine LearningWhen Bagging works?Under-fitting and over-fitting• Under-fitting:– High bias (models are not accurate)– Small variance (smaller influence of examples in the training set)• Over-fitting:– Small bias (models flexible enough to fit well to training data)– Large variance (models depend very much on the training set)CS 2750 Machine LearningAveraging decreases variance• Example– Assume we measure a random variable x with a N(µ,σ2) distribution– If only one measurement x1is done,• The expected mean of the measurement is µ• Variance is Var(x1)=σ2– If random variable x is measured K times (x1,x2,…xk) and the value is estimated as: (x1+x2+…+xk)/K, • Mean of the estimate is still µ• But, variance is smaller:– [Var(x1)+…Var(xk)]/K2=Kσ2 / K2 = σ2/K• Observe: Bagging is a kind of averaging!5CS 2750 Machine LearningWhen Bagging works • Main property of Bagging (proof omitted)– Bagging decreases variance of the base model without changing the bias!!!– Why? averaging!• Bagging typically helps– When applied with an over-fitted base model• High dependency on actual training data• It does not help much– High bias. When the base model is robust to the changes in the training data (due to sampling)CS 2750 Machine LearningBoosting. Theoretical foundations.• PAC: Probably Approximately Correct framework– (ε-δ) solution• PAC learning:– Learning with the pre-specified accuracy ε and confidence δ– the probability that the misclassification error is larger than ε is smaller than δ• Accuracy (ε ): Percent of correctly classified samples in test• Confidence (δ ): The probability that in one experiment some accuracy will be achievedδε≤> ))(( cMEP6CS 2750 Machine LearningPAC LearnabilityStrong (PAC) learnability:• There exists a learning algorithm that efficiently learns the classification with a pre-specified accuracy and confidenceStrong (PAC) learner: • A learning algorithm P that given an arbitrary– classification error ε (<1/2), and– confidence δ (<1/2)• Outputs a classifier – With a classification accuracy > (1-ε) – A confidence probability > (1- δ)– And runs in time polynomial in 1/ δ, 1/ε• Implies: number of samples N is polynomial in 1/ δ, 1/εCS 2750 Machine LearningWeak LearnerWeak learner:• A learning algorithm (learner) W– Providing classification accuracy >1-εo– With probability >1- δo• For some fixed and uncontrollable– classification error εo(<1/2) – confidence δo(<1/2)on an arbitrary problem7CS 2750 Machine LearningWeak learnability=Strong (PAC) learnability• Assume there exists a weak learner– On any problem it is better that a random guess (50 %) with confidence higher than 50 %• Question:– Is problem also PAC-learnable?– Can we generate an algorithm P that achieves an arbitrary (ε-δ) accuracy?• Why is important?– Usual classification methods (decision trees, neural nets), have specified, but uncontrollable performances. – Can we improve performance to achieve pre-specified accuracy (confidence)?CS 2750 Machine LearningWeak=Strong learnability!!!• Proof due to R. SchapireAn arbitrary (ε-δ) improvement is possibleIdea: combine multiple weak learners together– Weak learner W with confidence δoand maximal error εo– It is possible:• To improve (boost) the confidence• To improve (boost) the accuracyby training different weak learners on slightly different datasets8CS 2750 Machine LearningBoosting accuracyTrainingDistribution samplesH1and H2 classify differentlyCorrect classificationWrong classificationH3H1H2CS 2750 Machine LearningBoosting accuracy• Training– Sample randomly from the distribution of examples – Train hypothesis H1.on the sample– Evaluate accuracy of H1on the distribution– Sample randomly such that for the half of samples H1. provides correct, and for another half, incorrect results; Train hypothesis H2.–Train H3on samples from the distribution where H1and H2classify differently• Test– For each example, decide according to the majority vote of H1, H2and H39CS 2750 Machine LearningTheorem• If each hypothesis has an error εo, the final classifier has error < g(εo) =3 εo2-2εo3• Accuracy improved !!!!• Apply recursively to get to the target accuracy !!!CS 2750 Machine LearningTheoretical Boosting algorithm• Similarly to


View Full Document

Pitt CS 2750 - Machine Learning

Download Machine Learning
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Machine Learning and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Machine Learning 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?