# ILLINOIS CS 446 - 090517.3 (11 pages)

Previewing pages*1, 2, 3, 4*of 11 page document

**View the full content.**## 090517.3

Previewing pages
*1, 2, 3, 4*
of
actual document.

**View the full content.**View Full Document

## 090517.3

0 0 45 views

- Pages:
- 11
- School:
- University of Illinois - urbana
- Course:
- Cs 446 - Machine Learning

**Unformatted text preview:**

CS446 Machine Learning Fall 2017 Lecture 3 Generalization and Bayes Optimal Lecturer Sanmi Koyejo Scribe Vinitha Ravichandran Sep 5th 2017 Generalization The ability of a learned model to make accurate predictions on novel inputs is known as generalization Let X denote the instance space and Y denote the target space The data generating model iid is a distribution P P such that xi yi P where xi X and yi Y The training data is a sample Dn xi yi ni 0 Let hn denote a function that maps the instance space to the target space hn depends on the training data Generalization error Gn is the expected value of the misclassification rate when averaged over future data It is defined as the difference between the risk of the classifier at the distribution level and the risk of the classifier of a particular dataset We say that a classifier generalizes when Gn 0 Gn R hn P R hn Dn Evaluation of generalization error Approximation Since we dont get to see P we evaluate generalization error using approximate generalization error G n It is defined as the difference between the risk of the classifier on the test set and the risk of the classifier on the training set G n R hn DT est R hn DT rain Cross Validation An alternate strategy to evaluate the generalization of a classifier is cross validation In cross validation we split the dataset into k folds i e Dn D1 D2 Dk We train on all 1 2 3 Generalization and Bayes Optimal the folds but the k th and test on the k th in a round robin fashion We then compute the generalization error averaged over all the folds For a dataset draw D generalization error can be defined by G n ED R hn DT est R hn DT rain Why do we average We average to reduce the variance Since the dataset Dn that you see is random This implies that the classifier hn we train is also random as it is dependant on the dataset Therefore the generalization error estimate Gn is a random variable We obtain a different generalization error for every time the classifier is run over a different dataset Generalization error can be thought of as a sample from the potential distribution of all generalization errors Therefore average provides a good summarization of whats going on across all datasets In case of Approximation when DT est gets very large the G n sample obtained would closer to the Gn for the population However in case of cross validation it depends strongly on the dataset On obtaining a new classifier for every cross validation set which one do you pick Since the classifier obtained depends on the training dataset We are not evaluating a particular classifier but the classification procedure itself Thus we need to choose a classification procedure whose training accuracy is close to the accuracy on the population The following methods can be used to pick one Majority voting Retrain on entire dataset Pick one randomly Is generalization enough No Even if a train a classifier whose accuracy of test and training data is close to each other we cannot claim that we have learned a good model Since both the training and the test accuracy can be low Therefore we cannot evaluate a classifier using generalization error alone Calculation of Generalization Accuracy for a all positive classifier Let us consider a simple example where the input space is one dimensional The instance space X is as shown in Figure 1 where x B B and is uniformly distributed The label 3 Generalization and Bayes Optimal 3 y is the sign of x Let hn be a classifier that always predicts 1 Figure 1 Instance space X What is the maximum training accuracy that the classifier can get On having a dataset that is all 1 s we can get perfect training accuracy of 1 but that is highly unlikely Since we need to predict the sign of x Training accuracy can be defined by T rainingAcc N 1 X 1 y h x 1 1 y h x 0 N n 1 Since the classifier never predicts the label as 0 T rainingAcc 1 y h x 0 0 Thus N 1 X 1 y h x 1 N 1 N n 1 N X 1 y 1 n 1 Proportion of 1 s in the dataset Generalization is indicated by the accuracy of the classifier over the entire population Generalization Acc hn P Proportion of 1 s in P P y 1 1 it is a uniform distribution 2 hn 1 has good generalization error Generalization error is an asymptotic property When n gets large the difference between the accuracy on the entire distribution and the data set is going to be very small Since generalization error is estimated by averaging over several samples where each Dn is sufficiently large it goes to 12 quickly by central limit theorem Wikipedia 2017 What is the best possible training accuracy in the population sense if allowed to pick any classifier On choosing a classifier hn sign X we can obtain an accuracy of 1 which is the best possible accuracy 4 3 Generalization and Bayes Optimal Goal of Supervised Machine Learning The goal of supervised machine learning is to find or estimate a learner algorithm procedure that makes accurate predictions on new data In other words we aim to minimize the risk of the classifier R hn P The problem can be decomposed into two parts Being accurate on training data i e R hn Dn should be low Generalize i e R hn P R hn Dn 0 Evaluation of 1 Nearest Neighbour 1 NN Classifier What is the training accuracy of 1 nearest neighbour classifier Let s evaluate the training accuracy of 1 nearest neighbour classifier using a simple linear setting as shown in figure 2 The instance space X B B and is uniformly distributed Assume that the probability model is defined by X B 2B P Y 1 X 1 P Y 1 X P Y 1 X Figure 2 A picture denoting probability model for a simple linear setting The situation where there is no training accuracy for the classifier can never arise since we can always evaluate Acc hn Dn We can map the values of y for a given sample draw on a binary axis as shown in figure 3 The nearest neighbour to any point is itself Thus the prediction made by the 1 NN classifier on the training set is always accurate Therefore Acc 1 NN Dn 1 What is the generalization accuracy i e Acc 1 NN P for this classifier Assume that we have a new test sample DT est for which we plot the values of Y as shown in figure 4 We 3 Generalization and Bayes Optimal 5 Figure 3 A picture denoting the plot of Y given sample draw can observe that the classifier trained using the sample draw as in figure 3 will not make accurate predictions on the test dataset Had we used DT est to train the …

View Full Document