ILLINOIS CS 446 - 090517.1 - D3425951

Home> Schools> University of Illinois - urbana> (CS) > CS 446> 090517.1

DOC PREVIEW

ILLINOIS CS 446 - 090517.1

School name University of Illinois - urbana

Course Cs 446- Machine Learning

Pages 5

This preview shows page 1-2 out of 5 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 5 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 5 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 5 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

CS446: Machine Learning, Fall 2017Lecture 3 : Overfitting, Bayes Optimal, and Naive BayesLecturer: Sanmi Koyejo Scribe: Shirdon Gorse, Sep. 5th, 2017Today’s Objectives• Generalization• Bayes Optimal Classifier• Naive BayesGeneralizationGeneralization Error•Generalization is the ability of a learned model or classifier to fit unseen data/instances.•Generalization error is a measure of how accurately an algorithm is able to predictoutcome values for previously unseen data.•Suppose we have a classifierhnwhere the instance x relates to a label y. This classifierhnis dependent onDnwhereDn=(xi, yi), and(xi, yi)comes from a generateddistribution P. We say the Generalization ErrorGn=R(hn, P)− R(hn, Dn). Inother words, the Generalization Error is equal to the Risk of the classifierhnover thegenerated distribution P - the Risk of the classifier hnover the data set Dn.• As Gngoes to 0, we say that the classifier hn”generalizes”.•In practice, we never see an infinite distribution P, so their are two tricks to approximateR(hn, P ).–Train/test split, where we evaluate the ”out of sample” performance.ˆGn=R(hn, DTEST ) − R(hn, DTRAIN).– Cross-Validation12 3 : Overfitting, Bayes Optimal, and Naive Bayes∗The goal is to define a dataset to ”test” the model in the training phase inorder to limit problems like overfitting, and give an insight on how the modelwill generalize to an independent dataset.∗ˆGn=ED[R(hn, T estingDataset)− R(hn, T rainingDataset)], whereEDisthe average over all of the datasets.∗The reason we average is to reduce the variance of the generalization error.This is becauseDnis random,hnis random since it depends onDn, andGnis therefore a random variable.∗Since you get a new classifierhnfor ever cross validation set, which do youpick?· Majority Vote· Re-train on the entire data set· Just pick oneIs Generalization Enough?• No, but why is it not good enough?•Example: Suppose that random variableXis in the range [−B, B], the labely=sign(X), andXis uniformly distributed accross [−B, B]. We now consider thefollowing for finding the training accuracy if our classifier hn= 1:T raingAccuracy =1NNXn=11[y = h(X) = 1] + 1[y = h(X) = −1]=1NNXn=11[y = h(X) = 1]=1NNXn=11[y = 1]•With that logic, we know that the training Accuracy = the proportion of 1’s in ourdata.• The Generalization Acc(hn, P ) = P (y = 1) = 1/2•Our classifierhn= 1 has good generalization error asAcc(hn, P)− Acc(hn, Dn) getssmall.• But we know the best possible Accuracy = 1 if we set the classifier h(X) = sign(X)3 : Overfitting, Bayes Optimal, and Naive Bayes 3Goal of Supervised ML•The goal of supervised machine learning is to find a learner/algorithm that makesaccurate predictions on new data, AKA being able to generalize. We want the Risk ofthe classiferhnon the population P,R(hn, P), to be small. To do this, the followingstatements should be true:–Accurate training data, so the risk of the classifierhnon the data setDn,R(hn, Dn), is small.–The generalization error of the classifierhnshould approach 0,|R(hn, P)−R(hn, Dn)|− > 0.1 Nearest Neighbor Classifier•The training accuracy of the 1 nearest neighbor is 1 since the nearest neighbor to anytraining input would be itself.•The training accuracy would not be between 0 and 1 since the 1 nearest neighbortraining set is the set used by the classifier. Since this is the case, there is no possibleway for the 1 nearest neighbor classifier to choose an incorrect label.•Example: Suppose we said that random varibaleXis in the range of -B to B. AndsupposeP(y= 1|X) =fracX + B2B. You can evaluate the accuracy of the 1-NNclassifierhnfor the data setDnto be equal to 1 from the logic previously stated.The generalization however for 1-NN is usually not very good. 1 way to make thenearest neighbors classifier better for generalization is to increase the amount of nearestneighbors that are checked as the classifier predicts the label for new inputs.•1-NN classifier also changes a lot accross training data sets, or it has a ”high variance”.Which Classifier hnmaximizes Acc(hn,P) for the Example Above• F : The set of all possible functions•We want to find the classifier functionfthat lies within F, wherefmaximzes theAccuracy on the given population distribution, Acc(f, P).•This means we want to maximize the average indicator function of y = f(x),max(Ep[1[y=f(x)]]).• max(E(y|x)[1[y = f(x)]])• Set f(x) = 5x4 3 : Overfitting, Bayes Optimal, and Naive BayesFigure 1: A graphical model of the Problem where the optimal classifier function f* predicts1 if x > 0, and f* predicts -1 if x < 0• Ex max5x(E(y|x)[1[y = 5x]])• max(E(y|x)[1[y = 5x = 1]] + 1[y = 5x = −1])•Since 1[y= 1]∗1[5x= 1] = 1[y= 5x= 1],P(y= 1|x)∗1[5x= 1] +p(y=−1|x)1[5x=−1]• 5x = 1 if P (y = 1|x) > P (y = −1|x), and -1 otherwiseP (y = −1|x) = 1 − P (y = 1|x)P (y = 1) > 1 − P (y = 1|x) = 2P (y = 1|x) > 1P (y − 1|x) >12T hereforef(x) = sign(P (y = 1|x) −12)Bayes Optimal Classifier• Bayes Theorem–The theorem provides a method to calculate the probability of a hypothesis usingprior knowledge of its previous probability, P(h), the probabilities of observingvarious data where the hypothesis holds, P(D—h), and the probability of theobserved data, P(D). The theorem is given below:P (h|D) =P (D|h)P(h)P (D)– Bayes Optimal Hypothesis∗In most cases, their are several hypothesis candidates in the setH. We wantto find the the most probable hypothesis or functionf ∗(x) such that itequals the following:f ∗ (x) = argmaxAcc(f, P )3 : Overfitting, Bayes Optimal, and Naive Bayes 5∗ Acc(f*, P) is called the Bayes Optimal Accuracy.∗ ing R(f*, P) is the Bayes Optimal Risk.∗ f* = argmin R(f,P), where f* is Bayes optimal classifier.– Bayes optimal depends on ”R”. Suppose Risk = RecallRecall(hn, Dn) =PNn=11[y = h(x) = 1]PNn=11[y = 1]Figure 2: A graphical model to explain the difference between f* and hn• Figure 2 shows the difference between f* and hn.•The difference between the most optimal function, f* in the set of all functions F andthe most optimal function h* in the hypothesis set H is called the RepresentativeError. Representative Error = R(f*, P) - R(h*, P)•The difference between the error of h* and the hypothesis you chose is called theOptimization Error. Optimization Error = R(h*, P) - Rhn, Dn)•The difference between the error of the DatasetDnvs the error of the infinite set P iscalled the

View Full Document


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2 out of 5 pages.

ILLINOIS CS 446 - 090517.1

Sign up for free to view:

Please select your school