# ILLINOIS CS 446 - 090717.2 (6 pages)

Previewing pages*1, 2*of 6 page document

**View the full content.**## 090717.2

Previewing pages
*1, 2*
of
actual document.

**View the full content.**View Full Document

## 090717.2

0 0 51 views

- Pages:
- 6
- School:
- University of Illinois - urbana
- Course:
- Cs 446 - Machine Learning

**Unformatted text preview:**

CS446 Machine Learning Fall 2017 Lecture 4 Overfitting Naive Bayes Maximum Likelihood Lecturer Sanmi Koyejo Scribe Liqun Lu Sep 7th 2017 Review of generalization and Bayes optimal Generalization The goal of generalization is to find a a function algorithm that has good prediction accuracy performance on new data such that the risk R hn Dtest where Dtest 6 Dtrain satisfies R hn Dtest R hn P Bayes optimal The Bayes optimal classifier is defined as f argmin R f P f F The accuracy error is determined by two parts see Figure 1 a representation error b statistical error optimization error Figure 1 Representation error and statistical error 1 2 4 Overfitting Naive Bayes Maximum Likelihood Overfitting Underfitting Overfitting Overfitting means a function hn has good training performance but bad test performance i e hn does not generalize R hn Dtrain R hn Dtest Generally overfitting implies that the hypothesis class namely H is too big This means the function form one can choose is too flexible e g has excessive parameters such that it can fit very well on the training data but predict poorly on the test data An example is 1 NN classifiers In many cases it has perfect training performance but can have bad test performances To avoid overfitting one generally make the hypothesis class smaller Underfitting Underfitting is the opposite of overfitting and is usually hard to detect It implies size of H is too small Comparing with how we detect overfitting it is very rare that the test performance is better than training performance i e R hn Dtrain R hn Dtest However one way to detect potential underfitting is that if R hn Dtrain R hn Dtest it may imply under fitting An example is hn 1 where the train performance is nearly the same as test performance Bias Variance The bias and variance have the same meaning for classifiers Suppose x is an estimator of x P The bias is defined as Bias x E x where is the true value that is probably unknown The variance of x is 1X V ar x x i E x 2 E x E x 2 n i For classifiers predictors h the bias is generally written as Bias h R E hn P R f P where f is the Bayes optimal classifiers Alternatively in some text it is also written as Bias h R h P R f P 4 Overfitting Naive Bayes Maximum Likelihood 3 In either definition the bias actually captures the influence of choice of H i e the representation error They tend to be close in general as the data set size becomes large The variance of estimator hn is defined as h i h i V ar hn E R E hn P R hn P 2 E R h P R hn P 2 It gives expectation of how much the classifiers would deviate from the expected value from different draws Question Is R f P 0 No Think about the example where x U B B and P y 1 x x B 2B In this case f sign x is still Bayes optimal However the accuracy of f Acc f P 6 1 In general there is always a trade off between bias and variance For example The classifier hn 1 has V ar hn 0 The bias is hard to compute but intuitively it should be high The classifier 1 NN V ar hn is high and the bias will be low We usually seek for a good trade off point between these two w r t a certain metric e g risk utility functions R Figure 2 Figure 2 Trade off between bias and variance In a special case where R is represented by squared loss i e R hn P E y hn x 2 the error from such classifier is nicely characterized as Error hn noise bias 2 var 4 4 Overfitting Naive Bayes Maximum Likelihood where noise is the irreducible error or in other words error of Bayes optimal classifier How do we pick good classifiers Try random algorithm Empirical risk minimization ERM try to approximate hn argminh H R h Dn Probabilistic model approach find P P and find a classifier hn such that hn argminf F R f P This means to find P such that P P is small and examine a very large hypothesis class Naive Bayes Naive Bayes is among one large class of algorithms known as general models The goal of general models is that given Dn find a good conditional probability distribution P such that P x y P x y P y In naive Bayes mode we assume that P x y d Y P xj y j 1 where d is the dimension of x In other words naive Bayes neglects the correlations between different dimensions of x and only assume their dependences on y Figure 3 Contour plot of probability distribution pdf of a the result of naive Bayes model and b the true distribution Figure 3 illustrates an example of naive Bayes model The two dimensions of x x1 and x2 are assumed to be independent random variables both following Gaussian distribution 4 Overfitting Naive Bayes Maximum Likelihood 5 with parameters 1 12 and 2 22 respectively This yields results represented in a Nonetheless the true distribution might well be the case represented in b where strong correlation exists between x1 and x2 The Bayes Theorem gives the following equation which is how we build the naive Bayes model P x y P y P y x P x Question What are the parameters that define the naive Bayes model We need to know the marginal probability parameters n o c j dj 1 where is the set of parameters we need to know c is Bernoulli distribution parameter i e c 1 1 and j is the definitional n o distribution parameter 2 P xj y j e g for Gaussian distribution j j j Logistic Regression Logistic regression is a model based on Bernoulli random variables To understand logistic regression let s first define the sigmoid function see also Figure 4 sigmoid z 1 1 e z Figure 4 Sigmoid function Logistic regression is a known as a discriminative classifier It is defined as P y x Bernoulli sigmoid wT x 1 6 where wT x 4 Overfitting Naive Bayes Maximum Likelihood Pd i 1 wi xi and wi are weight parameters In other words P y 1 x 1 1 e wT x and meanwhile P y 1 x 1 P y 1 x 1 1 1 Tx w 1 e 1 ewT x The probability P y 1 x is also a sigmoid function Maximum likelyhood Maximum likelyhood refers to finding parameters that are most likely such that argmax P Dn where is the parameters we use to construct the estimator P P x y In general directly maximizing the probability is difficult We usually use negative loglikelyhood argmin log P Dn

View Full Document