DOC PREVIEW
ILLINOIS CS 446 - 090717.1

This preview shows page 1-2 out of 7 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 7 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 7 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 7 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

1 Recap1.1 Generalization1.2 Bayes Optimal2 Overfitting3 Underfitting4 Bias and Variance4.1 Bias Variance Tradeoff4.2 Special Case5 Picking good learners6 Classifiers6.1 Naive Bayes Classifier6.2 Logistic Regression7 Maximum Likelihood EstimationCS446: Machine Learning, Fall 2017Lecture 4 : Overfitting, Probabilistic Models, MLELecturer: Sanmi Koyejo Scribe: Sai Krishna Bollam, Sep. 7th, 20171 Recap1.1 GeneralizationGeneralization refers to how well the trained model performs on data not seen before. Ourgoal is to find a model that has good prediction accuracy on new data. In terms of riskfunction, we are trying to minimize:R(hn, DT est).The risk of classifier on test dataDT estis meant to approximate the risk on data generatingmodel P .R(hn, DT est) ≈ R(hn, P )1.2 Bayes OptimalBayes Optimal is the function that minimizes the risk and is denoted as:f∗= argminf∈FR(hn, P ).where F is the space of all possible prediction functions.2 OverfittingOverfitting occurs when the learned function performs well on the training data, but performspoorly on test data. In other terms, the model doesn’t generalize well. We can representthis in terms of risk function as follows:R(hn, DT rain)  R(hn, DT est)12 4 : Overfitting, Probabilistic Models, MLETypically this hints that our hypothesis spaceHis too big. A general fix is to try and makethe hypothesis space smaller.Example 1.1 In case of 1-NN classifier, the training performance could be perfect but theclassifier has a bad test performance.3 UnderfittingUndefitting is the opposite of overfitting, where the model doesn’t fit the data well. It istypically harder to detect. Underfitting can happen when the hypothesis spaceHis toosmall.If the performance of the model is similar on training and test sets, it may imply underfitting:R(hn, DT rain) ≈ R(hn, DT est)However, in some rare scenarios the model could perform well on test data compared to thetraining data:R(hn, DT est) < R(hn, DT rain)A general fix is to make the hypothesis space H larger.Example 2.The classifierhn= 1 which we’ve looked at in previous classes. It predicts thesame class for all inputs and performs poorly on both the training and test data.4 Bias and VarianceBias measures the representation error and variance denotes how far the values are spreadout from the average value. The bias and variance of the risk functionRfor a predictorhnare defined as:Bias(R; hn) = E[R(hn, P ) − R(f∗, P )]V ariance(R; hn) = E(E[R(hn, P )] − R(hn, P ))2 .The bias and variance for the predictor are defined as following:Bias(hn) = E[hn− f∗] = E[hn] − f∗V ariance(hn) = E(E[hn] − hn)2 = E[hn2] − E[hn]24 : Overfitting, Probabilistic Models, MLE 3Table 1: Bias and Variance for two classifiersClassifier hnVariance(hn) Bias(hn)1 0 very high1-NN high lower than hn= 14.1 Bias Variance TradeoffNext, we looked at the bias and variance of two classifiers discussed earlier as shown inTable 1. In general, if the bias for a model is high, it will have low variance and vice versa.So we seek a good tradeoff between the two with respect two the risk function R. This canbe graphically shown as in Figure 1.Figure 1: Bias Variance TradeoffSanmi suggested ‘Structural Risk Minimization ’as a topic for literature review for gradstudents. It studies the tradeoff between bias and variance.4.2 Special CaseNext, we looked at a special case where the risk function is a squared loss function. Thiscan be denoted as:R(hn, P ) = E[(y − hn(x))2].Here, the error in the model hndecomposes as shown below:Error(hn) = noise + V ariance + Bias2where noise is an irreducible error or the error of Bayes Optimal.4 4 : Overfitting, Probabilistic Models, MLE5 Picking good learnersWe then looked at some of the approaches to picking good learner based on what we’velearned.0.Try a random algorithm: An easy and obvious approach would be to come up with analgorithm by looking at the data. This can give good performance. However, we couldtry some principled approach.1.Empirical Risk Minimization (ERM): In this approach, we take a hypothesis spaceHand search for the learner that minimizes the risk. Most of the effort goes into findingthe hypothesis space. Mathematically we can represent this as follows:hn= argminh∈HR(h, Dn)2.Probabilistic Approach: We find a populationbPthat approximates the data generatingdistributionPand then search for a learner that minimizes the risk from the spaceof all prediction functions. In this case the restrictions are built into finding theapproximation of data generating distribution and we use a relatively simple approachto find the learner.hn= argminf∈FR(f,bP ),wherebP ≈ P is the population approximation estimated from input data Dn.6 Classifiers6.1 Naive Bayes ClassifierNaive Bayes classifier is from a class of algorithms known as Generative Models. Generativealgorithms model the joint probability distributionP(X, y). The goal here is to find a goodmodelbP for data Dnsuch that:P (X, y) ≈bP (X | y)bP (y).In Naive Bayes model, we assume that the features in X are independent. So the conditionalprobabilityP(X | y) can be written as the product of individual conditional probabilities.Mathematically it can be represented as follows:bP (X | y) =dYi=1P (xi| y),where X = (x1, x2, ..., xd).4 : Overfitting, Probabilistic Models, MLE 5Example 3. Let us consider X = (x1, x2). In this case:bP (X | y = c) ≈ P (x1| y = c)P (x2| y = c).The resulting probability density function depends on the distribution of featuresxi. If eachP(xi| y=c) has a Gaussian distribution for a classc. The resulting probability distributionis product of two Gaussians. From the above example let’s assume:P (x1| y = c) ∼ N (µ1, σ21), P (x2| y = c) ∼ N (µ2, σ22).The resulting distribution has a distribution which can be represented as shown in Figure 2,with center at (µ1, µ2) and variances ofσ21andσ22in the direction ofx1andx2respectively.Figure 2:bP (X | y = c)After we find the conditional probability, we predict the output class using the followingformula:P (y | X) =P (X | y) P (y)P (X)To calculate this value, we also need to knowP(y). This can be represented as a Bernoullidistribution.P (y = c | πc) = Bern(πc),where c denotes the class. So, the set of parameters we need are:Θ =n{πc}, {θi}di=1owhereθiis the parameter defining the distribution ofxi. In case of Gaussian,θi={µi, σ2i}.6 4 : Overfitting, Probabilistic Models, MLE6.2 Logistic RegressionSigmoid FunctionSigmoid is


View Full Document

ILLINOIS CS 446 - 090717.1

Download 090717.1
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view 090717.1 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view 090717.1 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?