# ILLINOIS CS 446 - 090517.1 (5 pages)

Previewing pages*1, 2*of 5 page document

**View the full content.**## 090517.1

Previewing pages
*1, 2*
of
actual document.

**View the full content.**View Full Document

## 090517.1

0 0 43 views

- Pages:
- 5
- School:
- University of Illinois - urbana
- Course:
- Cs 446 - Machine Learning

**Unformatted text preview:**

CS446 Machine Learning Fall 2017 Lecture 3 Overfitting Bayes Optimal and Naive Bayes Lecturer Sanmi Koyejo Scribe Shirdon Gorse Sep 5th 2017 Today s Objectives Generalization Bayes Optimal Classifier Naive Bayes Generalization Generalization Error Generalization is the ability of a learned model or classifier to fit unseen data instances Generalization error is a measure of how accurately an algorithm is able to predict outcome values for previously unseen data Suppose we have a classifier hn where the instance x relates to a label y This classifier hn is dependent on Dn where Dn xi yi and xi yi comes from a generated distribution P We say the Generalization Error Gn R hn P R hn Dn In other words the Generalization Error is equal to the Risk of the classifier hn over the generated distribution P the Risk of the classifier hn over the data set Dn As Gn goes to 0 we say that the classifier hn generalizes In practice we never see an infinite distribution P so their are two tricks to approximate R hn P Train test split where we evaluate the out of sample performance G n R hn DT EST R hn DT RAIN Cross Validation 1 2 3 Overfitting Bayes Optimal and Naive Bayes The goal is to define a dataset to test the model in the training phase in order to limit problems like overfitting and give an insight on how the model will generalize to an independent dataset G n ED R hn T estingDataset R hn T rainingDataset where ED is the average over all of the datasets The reason we average is to reduce the variance of the generalization error This is because Dn is random hn is random since it depends on Dn and Gn is therefore a random variable Since you get a new classifier hn for ever cross validation set which do you pick Majority Vote Re train on the entire data set Just pick one Is Generalization Enough No but why is it not good enough Example Suppose that random variable X is in the range B B the label y sign X and X is uniformly distributed accross B B We now consider the following for finding the training accuracy if our classifier hn 1 T raingAccuracy N 1 X 1 y h X 1 1 y h X 1 N n 1 N 1 X 1 y h X 1 N n 1 N 1 X 1 y 1 N n 1 With that logic we know that the training Accuracy the proportion of 1 s in our data The Generalization Acc hn P P y 1 1 2 Our classifier hn 1 has good generalization error as Acc hn P Acc hn Dn gets small But we know the best possible Accuracy 1 if we set the classifier h X sign X 3 Overfitting Bayes Optimal and Naive Bayes 3 Goal of Supervised ML The goal of supervised machine learning is to find a learner algorithm that makes accurate predictions on new data AKA being able to generalize We want the Risk of the classifer hn on the population P R hn P to be small To do this the following statements should be true Accurate training data so the risk of the classifier hn on the data set Dn R hn Dn is small The generalization error of the classifier hn should approach 0 R hn P R hn Dn 0 1 Nearest Neighbor Classifier The training accuracy of the 1 nearest neighbor is 1 since the nearest neighbor to any training input would be itself The training accuracy would not be between 0 and 1 since the 1 nearest neighbor training set is the set used by the classifier Since this is the case there is no possible way for the 1 nearest neighbor classifier to choose an incorrect label Example Suppose we said that random varibale X is in the range of B to B And suppose P y 1 X f racX B2B You can evaluate the accuracy of the 1 NN classifier hn for the data set Dn to be equal to 1 from the logic previously stated The generalization however for 1 NN is usually not very good 1 way to make the nearest neighbors classifier better for generalization is to increase the amount of nearest neighbors that are checked as the classifier predicts the label for new inputs 1 NN classifier also changes a lot accross training data sets or it has a high variance Which Classifier hn maximizes Acc hn P for the Example Above F The set of all possible functions We want to find the classifier function f that lies within F where f maximzes the Accuracy on the given population distribution Acc f P This means we want to maximize the average indicator function of y f x max Ep 1 y f x max E y x 1 y f x Set f x 5x 4 3 Overfitting Bayes Optimal and Naive Bayes Figure 1 A graphical model of the Problem where the optimal classifier function f predicts 1 if x 0 and f predicts 1 if x 0 Ex max5 x E y x 1 y 5x max E y x 1 y 5x 1 1 y 5x 1 Since 1 y 1 1 5x 1 1 y 5x 1 P y 1 x 1 5x 1 p y 1 x 1 5x 1 5x 1 if P y 1 x P y 1 x and 1 otherwise P y 1 x 1 P y 1 x P y 1 1 P y 1 x 2P y 1 x 1 1 P y 1 x 2 1 T heref oref x sign P y 1 x 2 Bayes Optimal Classifier Bayes Theorem The theorem provides a method to calculate the probability of a hypothesis using prior knowledge of its previous probability P h the probabilities of observing various data where the hypothesis holds P D h and the probability of the observed data P D The theorem is given below P h D P D h P h P D Bayes Optimal Hypothesis In most cases their are several hypothesis candidates in the set H We want to find the the most probable hypothesis or function f x such that it equals the following f x argmaxAcc f P 3 Overfitting Bayes Optimal and Naive Bayes 5 Acc f P is called the Bayes Optimal Accuracy ing R f P is the Bayes Optimal Risk f argmin R f P where f is Bayes optimal classifier Bayes optimal depends on R Suppose Risk Recall PN Recall hn Dn n 1 1 y h x PN n 1 1 y 1 1 Figure 2 A graphical model to explain the difference between f and hn Figure 2 shows the difference between f and hn The difference between the most optimal function f in the set of all functions F and the most optimal function h in the hypothesis set H is called the Representative Error Representative Error R f P R h P The difference between the error of h and …

View Full Document