DOC PREVIEW
ILLINOIS CS 446 - 090517.2

This preview shows page 1 out of 4 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 4 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 4 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

CS446: Machine Learning, Fall 2017Lecture 3: Bias/Variance Tradeoff, Overfitting, Bayes optimal,Probabilistic Models (Naive Bayes)Lecturer: Sanmi Koyejo Scribe: Xinchen Pan, Sep. 5th, 20171 Bias Variance Trade offSuppose that we have a data matrixxand a set of valuesyassociated to it. We try to usea function f to get a predictiony = f (x) + where E() = 0 and V ar() = σ2.Since we have the noise term, the prediction values we get will not be exactly same as thetrue values.We denote the prediction to beˆf(x) and define a term mean square error which isE[(y −ˆf(x))2]We then decompose itE[(y −ˆf(x))2] = E[y2+ˆf(x)2− 2yˆf(x)]= E[y2] + E[ˆf(x)2] − 2E[yˆf(x)]= V ar(y) + E[y]2+ V ar(ˆf(x)) + E[ˆf(x)]2− 2E[yˆf(x)]= V ar(y) + V ar(ˆf(x)) + E[(y −ˆf(x))]2= 2+ Variance + bias2ThusVariance: E[ˆf(x)2] − E[ˆf(x)]2Bias: E[(y −ˆf(x))]1.1 ExampleFor a binary classification problem, we let all the predictions to be 1.ˆf(x) = 1123: Bias/Variance Tradeoff, Overfitting, Bayes optimal, Probabilistic Models(Naive Bayes)Then we will have a variance of 0. However, we will have a large bias because the predictionsare just like random guess which cannot be accurate.2 Overfitting and Underfitting2.1 OverfittingOverfitting problem occurs when we have a low training error but high testing error. Inanother word, the learning algorithm does not generalize.Denotehnas a learning algorithm,Dtrainas the training dataset,Dtestas the testing dataset,R to be a risk function.When overfitting, we haveR(hn, Dtrain) << R(hn, Dtest)The training error is much smaller than the testing error.2.1.1 ExmpleConsider a k nearest neighbour example fork= 1. We will have a training error of 0 becausethe nearest neighbour in the training set will always be the point itself. However, when wehave a testing set, the nearest neighbour for the point in the testing set will not be itselfsince the point in the testing set will not be used for calculating distance.2.2 UnderfittingUnderfitting occurs when when have both low training error and low testing error. It meansthat the learning algorithm cannot represent the data well.Using the same notation, we haveR(hn, Dtrain) ≈ R(hn, Dtest)It is very rare thatR(hn, Dtrain) > R(hn, Dtest)3: Bias/Variance Tradeoff, Overfitting, Bayes optimal, Probabilistic Models(Naive Bayes)32.2.1 ExampleA simple example is that we use a constant number to predict the data which has a clearcurvature.Figure 1: Underfitting ExampleClearly we will expect the poor performance.3 Bayes OptimalGiven the true modelp, we are trying to use a functionfwhich can minimize the riskbetween the predicted value and the true value. We call this function Bayes Optimalf?= arg minf∈FR(f, p)4 Probabilitic Models(Naive Bayes)Naive Bayes is a classifier which can be used for both binary classification and multiclassclassification.In Naive Bayes algorithm, we assume that each predictor is independent on each other.43: Bias/Variance Tradeoff, Overfitting, Bayes optimal, Probabilistic Models(Naive Bayes)4.1 Bayes Formulap(y | x) =p(x | y)p(y)p(x)p(y | x) is the posterior probability which is the one we want to compute.p(x | y) is thelikelihood which we can get from our training data.p(y) is the prior probability for each y.p(x) is the prior probability for the x.Because of the independence assumption, we can calculate p(y | x) byp(y | x) = p(x1| y)p(x2| y) . . . p(xn| y)p(y)Further more, we assume that the value of x associated with each class are in somedistributions, a typical one is normal distribution.Thusp(x | y = c) =1q2πσ2ye−(x−µy)22σ2yThen we can compare the output probability along all the classes and return the class whichhas the largest probability.4.2 Pros and ConsPros:• It is fast and can usually give a good prediction.Cons:• The independence assumption does not usually hold in real life.•We need a method to deal with 0 probability situation that is when we have a featurein testing data but it does not exist in training


View Full Document

ILLINOIS CS 446 - 090517.2

Download 090517.2
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view 090517.2 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view 090517.2 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?