# ILLINOIS CS 446 - 090717.3 (6 pages)

Previewing pages*1, 2*of 6 page document

**View the full content.**## 090717.3

Previewing pages
*1, 2*
of
actual document.

**View the full content.**View Full Document

## 090717.3

0 0 45 views

- Pages:
- 6
- School:
- University of Illinois - urbana
- Course:
- Cs 446 - Machine Learning

**Unformatted text preview:**

CS446 Machine Learning Fall 2017 Lecture 1 Overfitting Naive Bayes Logistc Regression MLE Lecturer Sanmi Koyejo Scribe Zhenbang Wang Sept 7th 2017 Review Generalization Generalization refers to how accurately a model can predict the result from unseen data For a given data generating distribution P a good model should satisfy R hn Dtest R hn P Overfitting Overfitting means that a model hn has a good performance on training data but has a abd performance on unseen data In this condition hn does not generalize In terms of risk presentation R hn Dtrain R hn Dtest Underfitting Underfitting is the opposite of overfitting It means that a model does not fit our data well enough Typically small hypothesis function space H will lead to underfitting and underfitting is hard to detect However a similar performance between training data and test data can be a clue for underfitting In other words R hn Dtrain R hn Dtest For rare underfitting cases models perform better on test data than training data R hn Dtest R hn Dtrain Generally underfitting can be fixed by enlarging the size of H 1 2 1 Overfitting Naive Bayes Logistc Regression MLE Bayes Optimal The Bayes optimal classifier is the classifier that minimizes the risk f arg maxR f P f F where F is the spcae inlcuding all possible classifiers Bias and Variance Bias and variance are two measurements to descirpe errors in learning algorithms Bias Bias comes from representation error and bias of an estimator is the di erence between the expected value and the true value Assume that x is supposed to estimate the data distribution P then Bias x E x where is true value For classifiers bias is defined as following Bias hn R E hn P R f P or Bias hn R h P R f P where f respresents the optimal classifier Variance Variance captures small fluctuations in the training set Assume that x is supposed to estimate the data distribution P then V ar x E x E x 2 For classifiers variance is defined as following V ar hn E R E hn P R hn P 2 or V ar hn E R h P R hn P 2 1 Overfitting Naive Bayes Logistc Regression MLE 3 Bias Variance Tradeo Biasvariance tradeo is a common problem we want to simultaneously minimize bias and variance so we seek for a good tradeo point with a given risk function R For example see Figure 1 Figure 1 Bias Variance Tradeo Special Case When the risk measurement R is a square loss function total error can be nicely presentated Formally R hn P E y hn x 2 Error hn noise Bias 2 V ar where noise is the irreduciable error or the error of bayes optimal classifier Picking up good classifiers Try random algorithm Empirical risk minimization ERM hn R h Dn f H Probabilistic approach find a nice approximated data distribution P such that P P and then get our model by minimizing the risk hn R f P f F 4 1 Overfitting Naive Bayes Logistc Regression MLE Naive Bayes Naive Bayes classifiers are a family of algorithms known as generative models Generative models approximate the probability distribution P x y The goal is to find a good approximated distribution P such that P x y P x y P y In Naive Bayes we make an independece assumeption among the features Therefore we can rewrite the previous approximated distribution P as P x y n P xi y P y i 0 where x has n elements Example 1 Let x x1 x2 Then we will have P x y c P x1 y c P x2 y c where c is a certain class of y For Gaussian Naive Bayes we assume that x follows the Gaussian distibution i e P x1 y c N 1 12 P x2 y c N 2 22 Finally we can use our training data to estimate the pairs of parameters 1 12 and 2 22 Figure 2 Figure 2 Gaussian Naive Bayes Bayes theorem From Bayes theorem we have P y x P x y P y P x Since P x is a constant for every P y x we can simply ignore it 1 Overfitting Naive Bayes Logistc Regression MLE 5 Parameters in Naive Bayes In order to have the conditional probability we need to estimate the parameters c i di 1 where c follows Bernoulli distribution and i di 1 are the parameters that define the distribution of x Logistc Regression To understand logsitc regression we first define sigmoid function as z sigmoid z 1 1 e z See the plot of sigmoid function in Figure 3 Figure 3 Sigmoid function Logistic regression is a discriminative model It is defined as P y x Bernoulli wT x where wT x di 1 wi xi Therefore we can make prediction basing on P y 1 x 1 1 e wT x P y 1 x 1 P y 1 x 1 1 ewT x Maximum Likelihood Estimator Maximum likelyhood treis to find the prarmeters that are most likely In other words arg maxP Dn 6 1 Overfitting Naive Bayes Logistc Regression MLE where is the parameter to make prediction i e P P x y In practice maximizing the probability is di cult and it will also cause underflow so we use negative log likelihood instead arg min logP Dn

View Full Document