CMU CS 10701 - What’s learning, revisited Overfitting Bayes optimal classifier Naïve Bayes - D2088197

Home> Schools> Carnegie Mellon University> Computer Science (CS) > CS 10701> What’s learning, revisited Overfitting Bayes optimal classifier Naïve Bayes

DOC PREVIEW

CMU CS 10701 - What’s learning, revisited Overfitting Bayes optimal classifier Naïve Bayes

School name Carnegie Mellon University

Course Cs 10701- Introduction to Machine Learning

Pages 43

This preview shows page 1-2-3-20-21-22-41-42-43 out of 43 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 43 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 43 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 43 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 43 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 43 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 43 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 43 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 43 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 43 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 43 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

What’s learning, revisitedOverfittingBayes optimal classifierNaïve BayesAnnouncementsBias-Variance TradeoffTraining set errorTraining set error as a function of model complexityPrediction errorPrediction error as a function of model complexityComputing prediction errorWhy training set error doesn’t approximate prediction error?Why training set error doesn’t approximate prediction error?Test set errorTest set error as a function of model complexityHow many points to I use for training/testing?Error estimatorsError as a function of number of training examples for a fixed model complexityError estimatorsOverfittingWhat’s (supervised) learning, more formallyTypes of (supervised) learning problems, revisitedLearning is (simply) function approximation!ClassificationOptimal classificationBayes RuleHow hard is it to learn the optimal classifier?Conditional IndependenceWhat if features are independent?The Naïve Bayes assumptionThe Naïve Bayes ClassifierMLE for the parameters of NBSubtleties of NB classifier 1 – Violating the NB assumptionSubtleties of NB classifier 2 – Insufficient training dataMAP for Beta distributionBayesian learning for NB parameters – a.k.a. smoothingText classificationFeatures X are entire document – Xi for ith word in articleNB for Text classificationBag of words modelBag of words modelBag of Words ApproachNB with Bag of Words for text classificationTwenty News Groups resultsLearning curve for Twenty News GroupsWhat you need to knowClassification:Mitchell’s Machine Learning, Chapter 6What’s learning, revisitedOverfittingBayes optimal classifierNaïve BayesMachine Learning – 10701/15781Carlos GuestrinCarnegie Mellon UniversityJanuary 25th, 2006Announcements Recitations stay on Thursdays 5-6:30pm in Wean 5409 Special Matlab recitation: Jan. 25 Wed. 5:00-7:00pm in NSH 3305 First homework is out: Programming part and Analytic part Remember collaboration policy: can discuss questions, but need to write your own solutions and code Due Mon. Feb 6thbeginning of class Start early!Bias-Variance Tradeoff Choice of hypothesis class introduces learning bias More complex class → less bias More complex class → more varianceTraining set error Given a dataset (Training data) Choose a loss function e.g., squared error (L2) for regression Training set error: For a particular set of parameters, loss function on training data:Training set error as a function of model complexityPrediction error Training set error can be poor measure of “quality” of solution Prediction error: We really care about error over all possible input points, not just training data:Prediction error as a function of model complexityComputing prediction error Computing prediction  hard integral May not know t(x) for every x Monte Carlo integration (sampling approximation) Sample a set of i.i.d. points {x1,…,xM} from p(x) Approximate integral with sample averageWhy training set error doesn’t approximate prediction error? Sampling approximation of prediction error: Training error : Very similar equations!!!  Why is training set a bad measure of prediction error???Why training set error doesn’t approximate prediction error? Sampling approximation of prediction error: Training error : Very similar equations!!!  Why is training set a bad measure of prediction error???Because you cheated!!! Training error good estimate for a single w,But you optimized w with respect to the training error, and found w that is good for this set of samplesTraining error is a (optimistically) biased estimate of prediction errorTest set error Given a dataset, randomly split it into two parts:  Training data – {x1,…, xNtrain} Test data – {x1,…, xNtest} Use training data to optimize parameters w Test set error: For the final solution w*, evaluate the error using:Test set error as a function of model complexityHow many points to I use for training/testing? Very hard question to answer! Too few training points, learned w is bad Too few test points, you never know if you reached a good solution Bounds, such as Hoeffding’s inequality can help: More on this later this semester, but still hard to answer Typically: if you have a reasonable amount of data, pick test set “large enough”for a “reasonable” estimate of error, and use the rest for learning if you have little data, then you need to pull out the big guns… e.g., bootstrappingError estimatorsError as a function of number of training examples for a fixed model complexitylittle datainfinite dataError estimators Be careful!!! Test set only unbiased if you never never never neverdo any any any any learning on the test dataFor example, if you use the test set to selectthe degree of the polynomial… no longer unbiased!!!(We will address this problem later in the semester)Overfitting Overfitting: a learning algorithm overfits the training data if it outputs a solution w when there exists another solution w’ such that:What’s (supervised) learning, more formally Given: Dataset: Instances {〈x1;t(x1)〉,…, 〈xN;t(xN)〉} e.g., 〈xi;t(xi)〉 = 〈(GPA=3.9,IQ=120,MLscore=99);150K〉 Hypothesis space: H e.g., polynomials of degree 8 Loss function: measures quality of hypothesis h∈H e.g., squared error for regressionObtain: Learning algorithm: obtain h∈H that minimizes loss function e.g., using matrix operations for regression Want to minimize prediction error, but can only minimize error in datasetTypes of (supervised) learning problems, revisited Regression, e.g., dataset: 〈position; temperature〉 hypothesis space: Loss function: Density estimation, e.g., dataset: 〈grades〉 hypothesis space: Loss function: Classification, e.g., dataset: 〈brain image; {verb v. noun}〉 hypothesis space: Loss function:Learning is (simply) function approximation! The general (supervised) learning problem: Given some data (including features), hypothesis space, loss function Learning is no magic! Simply trying to find a function that fits the data Regression Density estimation Classification (Not surprisingly) Seemly different problem, very similar solutions…Classification Learn: h:X a Y X –features Y – target classes Suppose you know P(Y|X) exactly, how should you classify? Bayes classifier:

View Full Document

CMU CS 10701 - What’s learning, revisited Overfitting Bayes optimal classifier Naïve Bayes

Sign up for free to view:

This document and 3 million+ documents and flashcards
High quality study guides, lecture notes, practice exams
Course Packets handpicked by editors offering a comprehensive review of your courses
Better Grades Guaranteed


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-3-20-21-22-41-42-43 out of 43 pages.

CMU CS 10701 - What’s learning, revisited Overfitting Bayes optimal classifier Naïve Bayes

Sign up for free to view:

Please select your school