ILLINOIS CS 446 - 090517.2 - D3425952

Home> Schools> University of Illinois - urbana> (CS) > CS 446> 090517.2

DOC PREVIEW

ILLINOIS CS 446 - 090517.2

School name University of Illinois - urbana

Course Cs 446- Machine Learning

Pages 4

This preview shows page 1 out of 4 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 4 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 4 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

CS446: Machine Learning, Fall 2017Lecture 3: Bias/Variance Tradeoff, Overfitting, Bayes optimal,Probabilistic Models (Naive Bayes)Lecturer: Sanmi Koyejo Scribe: Xinchen Pan, Sep. 5th, 20171 Bias Variance Trade offSuppose that we have a data matrixxand a set of valuesyassociated to it. We try to usea function f to get a predictiony = f (x) + where E() = 0 and V ar() = σ2.Since we have the noise term, the prediction values we get will not be exactly same as thetrue values.We denote the prediction to beˆf(x) and define a term mean square error which isE[(y −ˆf(x))2]We then decompose itE[(y −ˆf(x))2] = E[y2+ˆf(x)2− 2yˆf(x)]= E[y2] + E[ˆf(x)2] − 2E[yˆf(x)]= V ar(y) + E[y]2+ V ar(ˆf(x)) + E[ˆf(x)]2− 2E[yˆf(x)]= V ar(y) + V ar(ˆf(x)) + E[(y −ˆf(x))]2= 2+ Variance + bias2ThusVariance: E[ˆf(x)2] − E[ˆf(x)]2Bias: E[(y −ˆf(x))]1.1 ExampleFor a binary classification problem, we let all the predictions to be 1.ˆf(x) = 1123: Bias/Variance Tradeoff, Overfitting, Bayes optimal, Probabilistic Models(Naive Bayes)Then we will have a variance of 0. However, we will have a large bias because the predictionsare just like random guess which cannot be accurate.2 Overfitting and Underfitting2.1 OverfittingOverfitting problem occurs when we have a low training error but high testing error. Inanother word, the learning algorithm does not generalize.Denotehnas a learning algorithm,Dtrainas the training dataset,Dtestas the testing dataset,R to be a risk function.When overfitting, we haveR(hn, Dtrain) << R(hn, Dtest)The training error is much smaller than the testing error.2.1.1 ExmpleConsider a k nearest neighbour example fork= 1. We will have a training error of 0 becausethe nearest neighbour in the training set will always be the point itself. However, when wehave a testing set, the nearest neighbour for the point in the testing set will not be itselfsince the point in the testing set will not be used for calculating distance.2.2 UnderfittingUnderfitting occurs when when have both low training error and low testing error. It meansthat the learning algorithm cannot represent the data well.Using the same notation, we haveR(hn, Dtrain) ≈ R(hn, Dtest)It is very rare thatR(hn, Dtrain) > R(hn, Dtest)3: Bias/Variance Tradeoff, Overfitting, Bayes optimal, Probabilistic Models(Naive Bayes)32.2.1 ExampleA simple example is that we use a constant number to predict the data which has a clearcurvature.Figure 1: Underfitting ExampleClearly we will expect the poor performance.3 Bayes OptimalGiven the true modelp, we are trying to use a functionfwhich can minimize the riskbetween the predicted value and the true value. We call this function Bayes Optimalf?= arg minf∈FR(f, p)4 Probabilitic Models(Naive Bayes)Naive Bayes is a classifier which can be used for both binary classification and multiclassclassification.In Naive Bayes algorithm, we assume that each predictor is independent on each other.43: Bias/Variance Tradeoff, Overfitting, Bayes optimal, Probabilistic Models(Naive Bayes)4.1 Bayes Formulap(y | x) =p(x | y)p(y)p(x)p(y | x) is the posterior probability which is the one we want to compute.p(x | y) is thelikelihood which we can get from our training data.p(y) is the prior probability for each y.p(x) is the prior probability for the x.Because of the independence assumption, we can calculate p(y | x) byp(y | x) = p(x1| y)p(x2| y) . . . p(xn| y)p(y)Further more, we assume that the value of x associated with each class are in somedistributions, a typical one is normal distribution.Thusp(x | y = c) =1q2πσ2ye−(x−µy)22σ2yThen we can compare the output probability along all the classes and return the class whichhas the largest probability.4.2 Pros and ConsPros:• It is fast and can usually give a good prediction.Cons:• The independence assumption does not usually hold in real life.•We need a method to deal with 0 probability situation that is when we have a featurein testing data but it does not exist in training

View Full Document


School:
Email:
New Password:
Confirm Password:

This preview shows page 1 out of 4 pages.

ILLINOIS CS 446 - 090517.2

Sign up for free to view:

Please select your school