ILLINOIS CS 446 - 090717.2 - D3425955

Home> Schools> University of Illinois - urbana> (CS) > CS 446> 090717.2

DOC PREVIEW

ILLINOIS CS 446 - 090717.2

School name University of Illinois - urbana

Course Cs 446- Machine Learning

Pages 6

This preview shows page 1-2 out of 6 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 6 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 6 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 6 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

CS446: Machine Learning, Fall 2017Lecture 4: Overfitting, Naive Bayes, Maximum LikelihoodLecturer: Sanmi Koyejo Scribe: Liqun Lu, Sep. 7th, 2017Review of generalization and Bayes optimalGeneralizationThe goal of generalization is to find a a function/algorithm that has good predictionaccuracy/performance onnewdata, such that the riskR(hn, Dtest), whereDtest6=Dtrain,satisfies:R(hn, Dtest) ≈ R(hn, P).Bayes optimalThe Bayes optimal classifier is defined as:f∗= argminf∈FR(f, P).The accuracy/error is determined by two parts (see Figure 1):(a) representation error(b) statistical error/optimization errorFigure 1: Representation error and statistical error12 4: Overfitting, Naive Bayes, Maximum LikelihoodOverfitting/UnderfittingOverfittingOverfitting means a functionhnhas good training performance, but bad test performance,i.e., hndoes not generalize:R(hn, Dtrain)  R(hn, Dtest).Generally, overfitting implies that the hypothesis class, namelyHis too big. This meansthe function form one can choose is too flexible (e.g. has excessive parameters), such that itcan fit very well on the training data, but predict poorly on the test data. An example is1-NN classifiers. In many cases, it has perfect training performance, but can have bad testperformances.To avoid overfitting, one generally make the hypothesis class smaller.UnderfittingUnderfitting is the opposite of overfitting, and is usually hard to detect. It implies sizeofHis too small. Comparing with how we detect overfitting, it is very rare that thetest performance is better than training performance, i.e.,R(hn, Dtrain) R(hn, Dtest).However, one way to detect potential underfitting is that, ifR(hn, Dtrain)≈ R(hn, Dtest), itmay imply under fitting. An example ishn= 1, where the train performance is nearly thesame as test performance.Bias/VarianceThe bias and variance have the same meaning for classifiers. Supposeˆxis an estimator ofx ∼ P. The bias is defined as:Bias(ˆx) = E[ˆx] − θ,where θ is the true value that is probably unknown. The variance of ˆx is:V ar(ˆx) =1nXi(ˆxi− E[ˆx])2= E(ˆx − E[ˆx])2 .For classifiers/predictorsˆh, the bias is generally written as:Bias(ˆh) = R(E[hn], P) − R(f∗, P),where f∗is the Bayes optimal classifiers. Alternatively, in some text it is also written as:Bias(ˆh) = R(h∗, P) − R(f∗, P).4: Overfitting, Naive Bayes, Maximum Likelihood 3In either definition, the bias actually captures the influence of choice ofH, i.e., the represen-tation error. They tend to be close in general as the data set size becomes large.The variance of estimator hnis defined asV ar(hn) = Eh(R (E[hn], P) − R (hn, P))2i= Eh(R(h∗, P) − R(hn, P))2i.It gives expectation of how much the classifiers would deviate from the expected value fromdifferent draws.Question: Is R(f∗, P) = 0 ?No. Think about the example where x ∼ U[−B, B], andP (y = 1|x) =x + B2B.In this case,f∗=sign(x) is still Bayes optimal. However, the accuracy off∗,Acc(f∗, P) 6= 1.In general, there is always a trade-off between bias and variance. For example,•The classifierhn= 1 hasV ar(hn) = 0. The bias is hard to compute but intuitively itshould be high;• The classifier 1-NN, V ar(hn) is high, and the bias will be low.We usually seek for a good trade-off point between these two w.r.t. a certain metric, e.g.risk utility functions R (Figure 2).Figure 2: Trade-off between bias and varianceIn a special case where R is represented by squared loss, i.e.,R(hn, P) = E(y − hn(x))2 ,the error from such classifier is nicely characterized asError(hn) = noise + (bias)2+ var,4 4: Overfitting, Naive Bayes, Maximum Likelihoodwhere noise is the irreducible error, or in other words, error of Bayes optimal classifier.How do we pick good classifiers?• Try random algorithm;• Empirical risk minimization (ERM): try to approximate hn= argminh∈HR(h, Dn);•Probabilistic model approach: findˆP ≈ P, and find a classifierhnsuch that:hn=argminf∈FR(f,ˆP). This means to findˆPsuch that|ˆP − P|is small, and examine avery large hypothesis class.Naive BayesNaive Bayes is among one large class of algorithms, known as general models. The goal ofgeneral models is that, givenDn, find a good conditional probability distributionˆP, suchthatP (x, y) ≈ˆP (x|y)ˆP (y).In naive Bayes mode, we assume thatP (x|y) =dYj=1P (xj|y),wheredis the dimension ofx. In other words, naive Bayes neglects the correlations betweendifferent dimensions of x, and only assume their dependences on y.Figure 3: Contour plot of probability distribution (pdf) of (a) the result of naive Bayesmodel, and (b) the true distributionFigure 3 illustrates an example of naive Bayes model. The two dimensions ofx,x1andx2are assumed to be independent random variables, both following Gaussian distribution4: Overfitting, Naive Bayes, Maximum Likelihood 5with parameters (µ1, σ21) and (µ2, σ22), respectively. This yields results represented in (a).Nonetheless, the true distribution might well be the case represented in (b), where strongcorrelation exists between x1and x2.The Bayes Theorem gives the following equation, which is how we build the naive Bayesmodel:P (y|x) =P (x|y)P (y)P (x).Question: What are the parameters that define the naive Bayes model?We need to know the marginal probability parameters:Θ =n{Πc}, {θj}dj=1o,where Θ is the set of parameters we need to know, Πcis Bernoulli distributionparameter (i.e.,c={−1,1}) , andθjis the definitional distribution parameter(P (xj|y, θj)), e.g., for Gaussian distribution, θj=nµj, σ2jo.Logistic RegressionLogistic regression is a model based on Bernoulli random variables. To understand logisticregression, let’s first define the sigmoid function (see also Figure 4):sigmoid(z) =11 + e−z. (1)Figure 4: Sigmoid functionLogistic regression is a known as a discriminative classifier. It is defined asP (y|x) = Bernoullisigmoid(wTx),6 4: Overfitting, Naive Bayes, Maximum Likelihoodwhere wTx =Pdi=1wixi, and wiare weight parameters. In other words,P (y = 1|x) =11 + e−wTx,and meanwhile,P (y = −1|x) = 1 − P (y = 1|x) = 1 −11 + e−wTx=11 + ewTx.The probability P (y = −1|x) is also a sigmoid function.Maximum likelyhoodMaximum likelyhood refers to finding parameters that are ”most likely”, such thatˆθ = argmaxθP (Dn|θ),whereˆθ is the parameters we use to construct the estimatorˆP = P (x, y|ˆθ).In general, directly maximizing the probability is difficult. We usually use negative

View Full Document


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2 out of 6 pages.

ILLINOIS CS 446 - 090717.2

Sign up for free to view:

Please select your school