# ILLINOIS CS 446 - 090717.1 (7 pages)

Previewing pages*1, 2*of 7 page document

**View the full content.**## 090717.1

Previewing pages
*1, 2*
of
actual document.

**View the full content.**View Full Document

## 090717.1

0 0 47 views

- Pages:
- 7
- School:
- University of Illinois - urbana
- Course:
- Cs 446 - Machine Learning

**Unformatted text preview:**

CS446 Machine Learning Fall 2017 Lecture 4 Overfitting Probabilistic Models MLE Lecturer Sanmi Koyejo 1 1 1 Scribe Sai Krishna Bollam Sep 7th 2017 Recap Generalization Generalization refers to how well the trained model performs on data not seen before Our goal is to find a model that has good prediction accuracy on new data In terms of risk function we are trying to minimize R hn DT est The risk of classifier on test data DT est is meant to approximate the risk on data generating model P R hn DT est R hn P 1 2 Bayes Optimal Bayes Optimal is the function that minimizes the risk and is denoted as f argmin R hn P f F where F is the space of all possible prediction functions 2 Overfitting Overfitting occurs when the learned function performs well on the training data but performs poorly on test data In other terms the model doesn t generalize well We can represent this in terms of risk function as follows R hn DT rain R hn DT est 1 2 4 Overfitting Probabilistic Models MLE Typically this hints that our hypothesis space H is too big A general fix is to try and make the hypothesis space smaller Example 1 1 In case of 1 NN classifier the training performance could be perfect but the classifier has a bad test performance 3 Underfitting Undefitting is the opposite of overfitting where the model doesn t fit the data well It is typically harder to detect Underfitting can happen when the hypothesis space H is too small If the performance of the model is similar on training and test sets it may imply underfitting R hn DT rain R hn DT est However in some rare scenarios the model could perform well on test data compared to the training data R hn DT est R hn DT rain A general fix is to make the hypothesis space H larger Example 2 The classifier hn 1 which we ve looked at in previous classes It predicts the same class for all inputs and performs poorly on both the training and test data 4 Bias and Variance Bias measures the representation error and variance denotes how far the values are spread out from the average value The bias and variance of the risk function R for a predictor hn are defined as Bias R hn E R hn P R f P V ariance R hn E E R hn P R hn P 2 The bias and variance for the predictor are defined as following Bias hn E hn f E hn f V ariance hn E E hn hn 2 E hn 2 E hn 2 4 Overfitting Probabilistic Models MLE 3 Table 1 Bias and Variance for two classifiers Classifier hn Variance hn Bias hn 1 0 very high 1 NN high lower than hn 1 4 1 Bias Variance Tradeoff Next we looked at the bias and variance of two classifiers discussed earlier as shown in Table 1 In general if the bias for a model is high it will have low variance and vice versa So we seek a good tradeoff between the two with respect two the risk function R This can be graphically shown as in Figure 1 Figure 1 Bias Variance Tradeoff Sanmi suggested Structural Risk Minimization as a topic for literature review for grad students It studies the tradeoff between bias and variance 4 2 Special Case Next we looked at a special case where the risk function is a squared loss function This can be denoted as R hn P E y hn x 2 Here the error in the model hn decomposes as shown below Error hn noise V ariance Bias2 where noise is an irreducible error or the error of Bayes Optimal 4 4 Overfitting Probabilistic Models MLE 5 Picking good learners We then looked at some of the approaches to picking good learner based on what we ve learned 0 Try a random algorithm An easy and obvious approach would be to come up with an algorithm by looking at the data This can give good performance However we could try some principled approach 1 Empirical Risk Minimization ERM In this approach we take a hypothesis space H and search for the learner that minimizes the risk Most of the effort goes into finding the hypothesis space Mathematically we can represent this as follows hn argmin R h Dn h H 2 Probabilistic Approach We find a population Pb that approximates the data generating distribution P and then search for a learner that minimizes the risk from the space of all prediction functions In this case the restrictions are built into finding the approximation of data generating distribution and we use a relatively simple approach to find the learner hn argmin R f Pb f F where Pb P is the population approximation estimated from input data Dn 6 6 1 Classifiers Naive Bayes Classifier Naive Bayes classifier is from a class of algorithms known as Generative Models Generative algorithms model the joint probability distribution P X y The goal here is to find a good model Pb for data Dn such that P X y Pb X y Pb y In Naive Bayes model we assume that the features in X are independent So the conditional probability P X y can be written as the product of individual conditional probabilities Mathematically it can be represented as follows Pb X y d Y i 1 where X x1 x2 xd P xi y 4 Overfitting Probabilistic Models MLE 5 Example 3 Let us consider X x1 x2 In this case Pb X y c P x1 y c P x2 y c The resulting probability density function depends on the distribution of features xi If each P xi y c has a Gaussian distribution for a class c The resulting probability distribution is product of two Gaussians From the above example let s assume P x1 y c N 1 12 P x2 y c N 2 22 The resulting distribution has a distribution which can be represented as shown in Figure 2 with center at 1 2 and variances of 12 and 22 in the direction of x1 and x2 respectively Figure 2 Pb X y c After we find the conditional probability we predict the output class using the following formula P y X P X y P y P X To calculate this value we also need to know P y This can be represented as a Bernoulli distribution P y c c Bern c where c denotes the class So the set of parameters we need are n o c i di 1 where i is the parameter defining the distribution of xi In case of Gaussian i i i2 6 6 2 4 Overfitting Probabilistic Models MLE Logistic Regression Sigmoid Function Sigmoid is defined as follows z sigmoid z 1 1 e z The graph of sigmoid function is shown in Figure 3 Figure 3 Sigmoid function Logistic Regression is a classifier from the class of models called Discriminative models It models the conditional probability P …

View Full Document