Classification Mitchell s Machine Learning Chapter 6 What s learning revisited Overfitting Bayes optimal classifier Na ve Bayes Machine Learning 10701 15781 Carlos Guestrin Carnegie Mellon University January 25th 2006 Announcements Recitations stay on Thursdays 5 6 30pm in Wean 5409 Special Matlab recitation Jan 25 Wed 5 00 7 00pm in NSH 3305 First homework is out Programming part and Analytic part Remember collaboration policy can discuss questions but need to write your own solutions and code Due Mon Feb 6th beginning of class Start early Bias Variance Tradeoff Choice of hypothesis class introduces learning bias More complex class less bias More complex class more variance Training set error Given a dataset Training data Choose a loss function e g squared error L2 for regression Training set error For a particular set of parameters loss function on training data Training set error as a function of model complexity Prediction error Training set error can be poor measure of quality of solution Prediction error We really care about error over all possible input points not just training data Prediction error as a function of model complexity Computing prediction error Computing prediction hard integral May not know t x for every x Monte Carlo integration sampling approximation Sample a set of i i d points x1 xM from p x Approximate integral with sample average Why training set error doesn t approximate prediction error Sampling approximation of prediction error Training error Very similar equations Why is training set a bad measure of prediction error Why training set error doesn t approximate prediction error Sampling approximation prediction error Because of you cheated Training error good estimate for a single w But you optimized w with respect to the training error and found w that is good for this set of samples Training error Training error is a optimistically biased estimate of prediction error Very similar equations Why is training set a bad measure of prediction error Test set error Given a dataset randomly split it into two parts Training data x1 xNtrain Test data x1 xNtest Use training data to optimize parameters w Test set error For the final solution w evaluate the error using Test set error as a function of model complexity How many points to I use for training testing Very hard question to answer Too few training points learned w is bad Too few test points you never know if you reached a good solution Bounds such as Hoeffding s inequality can help More on this later this semester but still hard to answer Typically if you have a reasonable amount of data pick test set large enough for a reasonable estimate of error and use the rest for learning if you have little data then you need to pull out the big guns e g bootstrapping Error estimators Error as a function of number of training examples for a fixed model complexity little data infinite data Error estimators Be careful Test set only unbiased if you never never never never do any any any any learning on the test data For example if you use the test set to select the degree of the polynomial no longer unbiased We will address this problem later in the semester Overfitting Overfitting a learning algorithm overfits the training data if it outputs a solution w when there exists another solution w such that What s supervised learning more formally Given Dataset Instances x1 t x1 xN t xN e g xi t xi GPA 3 9 IQ 120 MLscore 99 150K Hypothesis space H e g polynomials of degree 8 Loss function measures quality of hypothesis h H e g squared error for regression Obtain Learning algorithm obtain h H that minimizes loss function e g using matrix operations for regression Want to minimize prediction error but can only minimize error in dataset Types of supervised learning problems revisited Regression e g dataset position temperature hypothesis space Loss function Density estimation e g dataset grades hypothesis space Loss function Classification e g dataset brain image verb v noun hypothesis space Loss function Learning is simply function approximation The general supervised learning problem Given some data including features hypothesis space loss function Learning is no magic Simply trying to find a function that fits the data Regression Density estimation Classification Not surprisingly Seemly different problem very similar solutions Classification Learn h X a Y X features Y target classes Suppose you know P Y X exactly how should you classify Bayes classifier Why Optimal classification Theorem Bayes classifier hBayes is optimal That is Proof Bayes Rule Which is shorthand for How hard is it to learn the optimal classifier Data How do we represent these How many parameters Prior P Y Suppose Y is composed of k classes Likelihood P X Y Suppose X is composed of n binary features Complex model High variance with limited data Conditional Independence X is conditionally independent of Y given Z if the probability distribution governing X is independent of the value of Y given the value of Z e g Equivalent to What if features are independent Predict 10701Grade From two conditionally Independent features HomeworkGrade ClassAttendance The Na ve Bayes assumption Na ve Bayes assumption Features are independent given class More generally How many parameters now Suppose X is composed of n binary features The Na ve Bayes Classifier Given Prior P Y n conditionally independent features X given the class Y For each Xi we have likelihood P Xi Y Decision rule If assumption holds NB is optimal classifier MLE for the parameters of NB Given dataset Count A a B b number of examples where A a and B b MLE for NB simply Prior P Y y Likelihood P Xi xi Yi yi Subtleties of NB classifier 1 Violating the NB assumption Usually features are not conditionally independent Actual probabilities P Y X often biased towards 0 or 1 Nonetheless NB is the single most used classifier out there NB often performs well even when assumption is violated Domingos Pazzani 96 discuss some conditions for good performance Subtleties of NB classifier 2 Insufficient training data What if you never see a training instance where X1 a when Y b e g Y SpamEmail X1 Enlargement P X1 a Y b 0 Thus no matter what the values X2 Xn take P Y b X1 a X2 Xn 0 What now MAP for Beta distribution MAP use most likely parameter Beta prior equivalent to extra thumbtack flips As N prior is forgotten But for small sample size prior is important Bayesian learning for NB parameters a k a smoothing Dataset
View Full Document