DOC PREVIEW
ILLINOIS CS 446 - scribe_1

This preview shows page 1-2-3 out of 10 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 10 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 10 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 10 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 10 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

CS446 Machine Learning Fall 2017 Lecture 5 Probabilistic Model ML Grad Descent Regression Lecturer Sanmi Koyejo Scribe Shruti Bhargava Sep 12th 2017 Recap Overfitting We have seen that overfitting refers to the condition when our model or the function class H we are working with is too flexible or too big This means that we have incorporated extremely minor variations of the input too catering to noise signals along with the true signal The detection technique is train performance becomes high and exceeds the test performance Underfitting This implies that our H is too small or over constrained In other words it lacks the capacity to incorporate sufficient variations possessed by the true signal Bias Variance tradeoff This helps us to understand the error tradeoffs in a model Bias tells us something about the flexibility or representation ability It conveys how far one is from the best possible ie Bias E hn f where hn is our hypothesis function f is the Bayes optimal function Variance on the other hand talks about the noise or optimization error We have seen that V ariance E E hn hn 2 Most risk functions R can be expressed as some function of the bias and the variance of the model ie for a given R we can find a function such that E R hn Dn Bias V ariance where Dn is the dataset used for evaluation One example of such a function is if we look at the risk in terms of the squared loss or squared error that is defined as y hn 2 It can be shown that proof in section 6 4 4 of 1 2 5 Probabilistic Model ML Grad Descent Regression textbook E y hn 2 noise Bias2 V ariance This tells us the knobs that one can pull in order to minimize the error Several other examples are given in the textbook The fact that bias and variance are tied together can be understood by observing that increasing one would result in decrease in the other thereby affecting the total error hence the trade off matters Selecting Training Models We initially started by trying to select and train machine learning models The different tracks to do so were discussed previously as follows 0 Come up with an algorithm This seems vague but in practice a lot of effective machine learning is done through engineering and playing around with algorithms What matters is that your algorithm should give good predictions on new data 1 Empirical Risk Minimization ERM A more principled approach to finding models wherein from a class of functions one picks the model that gives least error on training data ie hn argmin R h Dn h H where Dn is the training data H is the search space subset of F 2 Probabilistic Modelling This framework also provides a structured approach to obtaining a good model It broadly involves the following 2 steps a Find p to approximately mimic the data generating distribution p b Pick the model from the entire function space that gives least error on this estimated distribution ie hn argmin R h p 1 h F In this lecture we understand how probabilistic modelling works Probabilistic Modelling This framework has an initial goal of modelling the data generating distribution Let us understand this process through an example 5 Probabilistic Model ML Grad Descent Regression 3 Example Suppose we are given 1 D input space 0 1 and a dataset Dn x1 xn For a binary dataset the Bernoulli model or the coin flip model works For instance we observe the dataset Dn 0 0 1 0 1 1 with independent and identically distributed iid samples We need to find a model distribution that explains this data We try to model Dn as a Bernoulli distribution B with parameter ie Dn B We know from our knowledge of Bernoulli distributions that a sample X 1 w p X 0 w p 1 In other words P X X 1 1 X 2 By plugging in the values we can see that P X 0 1 P X 1 Now we ll find the that best mimics the data generating distribution Some possible techniques one can do this are Maximum Likelihood ML Maximum a posteriori MAP Bayesian Model In this lecture we ll learn more about Maximum Likelihood MAP will be discussed in HW Bayesian models come into play later as Graphical models are discussed Maximum Likelihood ML Under probabilistic modelling framework for finding the right model we started with approximating the data generating distribution In the example above the data is 1 D binary valued ie 0 1 so we opted for Bernoulli Distribution with unknown parameter to approximately represent our data generating distribution Now we try to find the specific parameter values depends on the distribution for Bernoulli According to ML choose that maximizes the probability of the data Mathematically argmax P Dn 3 4 5 Probabilistic Model ML Grad Descent Regression We discussed earlier that often it is easier to work with log Probability The log probability of the data given is referred to as L Mathematically L logP Dn 4 argmax L 5 From 3 and 4 we get Since the samples are iid we can find the probability of our data as P Dn n Y P Xi 6 i 1 Plugging this in 4 L n X log P Xi i 1 From 2 we know how likely is a particular X given a parameter value So L n X xi log 1 xi log 1 i 1 To compute 5 since L is a continuous function we use the fact that the gradient is 0 at critical points maxima minima Therefore Pn Pn 1 xi dL i 1 xi i 1 0 d 1 n 1X xi n i 1 This can be interpreted as we want the estimator to predict 1 with a probability same as the fraction of 1s seen in the dataset Generally one of the following two major strategies can be used to compute argmin L Analytically solve L 0 to get the critical point as the required as discussed above This technique becomes harder and very tedious in some cases Gradient Descent This strategy useful when direct computation of critical point becomes harder It is an optimisation technique find the minimum of a function so we now consider our problem as minimising negative of the log likelihood or L It is a sequential algorithm and works in the literal sense of its name descend with the gradient After some steps you would hopefully end up at local minimum of the function 5 Probabilistic Model ML Grad Descent Regression 5 Figure 1 Gradient descent Mathematically at an iteration t t 1 t t L t where t called the step size defines how much you want to move in that direction Observe when L gets close to 0 the subsequent steps don t have much effect t When the gradient becomes 0 any further iterations would keep us at the same thereby ensuring we stay at the solution once we reach it Cons of Gradient descent only guarantees local optimality this …


View Full Document

ILLINOIS CS 446 - scribe_1

Download scribe_1
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view scribe_1 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view scribe_1 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?