DOC PREVIEW
ILLINOIS CS 446 - 090517.3

This preview shows page 1-2-3-4 out of 11 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 11 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 11 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 11 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 11 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 11 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

CS446: Machine Learning, Fall 2017Lecture 3 : Generalization and Bayes OptimalLecturer: Sanmi Koyejo Scribe: Vinitha Ravichandran, Sep. 5th, 2017GeneralizationThe ability of a learned model to make accurate predictions on novel inputs is known asgeneralization.LetXdenote the instance space andYdenote the target space. The data generating modelis a distributionP ∈ P, such that (xi, yi)iid∼ P, wherexi∈ X,andyi∈ Y. The trainingdata is a sampleDn={(xi, yi)}ni=0. Lethndenote a function that maps the instance spaceto the target space. hndepends on the training data.Generalization error,Gnis the expected value of the misclassification rate when averagedover future data. It is defined as the difference between the risk of the classifier at thedistribution level and the risk of the classifier of a particular dataset. We say that a classifiergeneralizes when Gn→ 0.Gn= R(hn, P) − R(hn, Dn)Evaluation of generalization errorApproximationSince we dont get to seeP, we evaluate generalization error using approximate generalizationerror (ˆGn). It is defined as the difference between the risk of the classifier on the test setand the risk of the classifier on the training set.ˆGn= R(hn, DT est) − R(hn, DT rain)Cross ValidationAn alternate strategy to evaluate the generalization of a classifier is cross validation. Incross-validation, we split the dataset into k folds i.e.Dn={D1, D2...Dk}. We train on all12 3 : Generalization and Bayes Optimalthe folds but the kth, and test on the kth, in a round-robin fashion. We then compute thegeneralization error averaged over all the folds. For a dataset draw D, generalization errorcan be defined byˆGn≈ ED[R(hn, DT est) − R(hn, DT rain)].Why do we average?We average to reduce the variance. Since the dataset (Dn) that you see is random. Thisimplies that the classifier (hn) we train is also random as it is dependant on the dataset.Therefore, the generalization error estimate (Gn) is a random variable. We obtain a differentgeneralization error for every time the classifier is run over a different dataset. Generalizationerror can be thought of as a sample from the potential distribution of all generalization errors.Therefore, average provides a good summarization of whats going on across all datasets.In case of Approximation, whenDT estgets very large theˆGnsample obtained would closerto theGnfor the population. However, in case of cross validation it depends strongly onthe dataset.On obtaining a new classifier for every cross validation set, which one doyou pick?Since the classifier obtained depends on the training dataset. We are not evaluating aparticular classifier but the classification procedure itself. Thus, we need to choose aclassification procedure whose training accuracy is close to the accuracy on the population.The following methods can be used to pick one:-• Majority voting• Retrain on entire dataset• Pick one randomlyIs generalization enough?No. Even if a train a classifier whose accuracy of test and training data is close to eachother, we cannot claim that we have learned a good model. Since both the training andthe test accuracy can be low. Therefore, we cannot evaluate a classifier using generalizationerror alone.Calculation of Generalization Accuracy for a all- positive classifierLet us consider a simple example where the input space is one-dimensional. The instancespaceXis as shown in Figure 1 wherex ∈[−B, B] and is uniformly distributed. The label3 : Generalization and Bayes Optimal 3y is the sign of x. Let hnbe a classifier that always predicts 1.Figure 1: Instance space XWhat is the maximum training accuracy that the classifier can get? On having a datasetthat is all 1’s we can get perfect training accuracy of 1 but that is highly unlikely.Since we need to predict the sign of x. Training accuracy can be defined byT rainingAcc =1NNXn=11[y=h(x)=1]+1[y=h(x)=0].Since the classifier never predicts the label as 0,1[y=h(x)=0]= 0. Thus,T rainingAcc =1NNXn=11[y=h(x)=1]=1NNXn=11[y=1]= Proportion of 1’s in the dataset.Generalization is indicated by the accuracy of the classifier over the entire population.Generalization = Acc(hn, P)= Proportion of 1’s in P= P (y = 1)=12[∵ it is a uniform distribution].hn=1 has good generalization error. Generalization error is an asymptotic property. Whenn gets large, the difference between the accuracy on the entire distribution and the data setis going to be very small. Since generalization error is estimated by averaging over severalsamples where eachDnis sufficiently large it goes to12quickly by central limit theoremWikipedia (2017).What is the best possible training accuracy in the population sense if allowed to pick anyclassifier? On choosing a classifierhn=sign(X), we can obtain an accuracy of 1 which isthe best possible accuracy.4 3 : Generalization and Bayes OptimalGoal of Supervised Machine LearningThe goal of supervised machine learning is to find or estimate a learner/algorithm/procedurethat makes accurate predictions on new data. In other words, we aim to minimize the riskof the classifier R(hn, P). The problem can be decomposed into two parts.• Being accurate on training data i.e R(hn, Dn) should be low• Generalize i.e |R(hn, P) − R(hn, Dn)| → 0Evaluation of 1-Nearest Neighbour (1-NN) ClassifierWhat is the training accuracy of 1-nearest neighbour classifier? Let’s evaluate the trainingaccuracy of 1-nearest neighbour classifier using a simple linear setting as shown in figure 2.The instance spaceX ∈[−B, B] and is uniformly distributed. Assume that the probabilitymodel is defined byP (Y = 1|X) =X + B2BP (Y = −1|X) = 1 − P (Y = 1|X).Figure 2: A picture denoting probability model for a simple linear settingThe situation where there is no training accuracy for the classifier can never arise since wecan always evaluateAcc(hn, Dn). We can map the values of y for a given sample draw on abinary axis as shown in figure 3. The nearest neighbour to any point is itself. Thus, theprediction made by the 1-NN classifier on the training set is always accurate. Therefore,Acc(1 − NN, Dn) = 1What is the generalization accuracy i.eAcc(1− NN, P)for this classifier? Assume that wehave a new test sample (DT est) for which we plot the values of Y as shown in figure 4. We3 : Generalization and Bayes Optimal 5Figure 3: A picture denoting the plot of Y given sample drawcan observe that the classifier trained using the sample draw as in figure 3 will not makeaccurate


View Full Document

ILLINOIS CS 446 - 090517.3

Download 090517.3
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view 090517.3 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view 090517.3 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?