# ILLINOIS CS 446 - 091217.2 (8 pages)

Previewing pages*1, 2, 3*of 8 page document

**View the full content.**## 091217.2

Previewing pages
*1, 2, 3*
of
actual document.

**View the full content.**View Full Document

## 091217.2

0 0 56 views

- Pages:
- 8
- School:
- University of Illinois - urbana
- Course:
- Cs 446 - Machine Learning

**Unformatted text preview:**

CS446 Machine Learning Fall 2017 Lecture 5 Optimization Convexity Gradient Descent Lecturer Sanmi Koyejo Scribe Ally Kaminsky September 12 2017 Recap of previous lecture Overfitting and Underfitting Overfitting Overfitting of a model hn occurs when it is too complex This means that the hypothesis class H is too big or flexible i e has too many parameters Overfitting causes a model perform well on training data but poorly on testing data Underfitting Underfitting of a model means that its hypothesis class H is too small This means that the function form has too few parameters This case is more difficult to identify than overfitting because performance of a model on its training data is almost always better than its performance on testing data Bias Variance Tradeoff The bias of an estimator is Bias hn E hn where hn is the estimator on some observed data and is an unknown fixed constant The variance of an estimator hn is defined as V ariance hn E E hn hn 2 where hn is an estimator on some observed data Bias and variance help us understand error tradeoffs in a model which applies to all supervised learning tasks The expected risk of a model can be expressed as E R hn Dn Bias V ariance 1 2 5 Optimization Convexity Gradient Descent where hn is is an estimator on some observed data Dn Selecting and Training Models There are several ways to select and train models Come up with an algorithm that works well i e engineer an algorithm This is not a reasonable systematic approach but works well in practice sometimes Empirical Risk Minimization ERM Find a model hn such that hn argmin R h Dn where H is the hypothesis class h H of the model and R h Dn is the risk of a model h given dataset Dn The resulting model from ERM is very tied to the risk function Probabilistic model approach Find a good approximation P P Find a model hn such that hn argmin R h P where F is the class of all h F possible prediction functions and R h P is the risk of a model h given P Benefits of the probabilistic model approach You do not need to start from scratch for each risk function because P has already been found For this reason probabilistic models can be reused Probabilistic models can incorporate domain knowledge Fitting a model Motivating Example The goal of the following is to find a model that explains some data well Suppose X 0 1 and that dataset Dn x1 x2 xn Assume Dn is sampled independent and identically distributed iid from the Bernoulli distribution B for some parameter This means that we will sample X P x as 1 with probability and 0 with probability 1 That is P X x 1 1 x The conditional probabilities that x 0 and x 1 given are P x 0 1 P x 1 5 Optimization Convexity Gradient Descent 3 Methods for fitting a model To fit a model to the data from the example above we need to find the parameter This can be done in several ways including Maximum Likelihood Estimation MLE Maximum a posterior MAP Bayesian model Maximum Likelihood Estimation Maximum Likelihood Estimation is a method of fitting a model to some observed data by finding the parameter values of the model that maximize the likelihood of seeing the observed data values given those parameters The likelihood of seeing some observed dataset given parameter values can be expressed as P Dn To maximize the likelihood we want to find argmax P Dn The log likelihood function L is often used to find the maximum likelihood because finding its maximum is equivalent to finding the likelihood function s maximum and it makes calculations easier The log likelihood function is L log P Dn The end goal of maximum likelihood estimation then is to find argmin L Bernoulli Case Example Continuing the previous example the probability of seeing a single data observation xi given parameter is P xi xi 1 1 xi This can be thought of as for a fixed how likely is it to observe 1 or 0 We want to know the probability of a dataset of observations rather n Q than a single data observation given This can be expressed as P Dn P xi i 1 given P xi and that the data was sampled iid The log likelihood function for the iid case is n X L log P xi i 1 We can then substitute P xi for the Bernoulli distribution example into the log likelihood function The Bernoulli log likelihood function is then L n X xi log 1 xi log 1 i 1 where n is the number of observations in the dataset xi is the ith observation in the dataset and is a parameter 4 5 Optimization Convexity Gradient Descent Optimization Methods To find the best parameter to fit a model to some observed data we need to find argmin L This is an optimization problem thus we can use various optimization methods to find including Solving directly To find the optimal solution for we can solve L 0 This method uses calculus to find the critical points of the log likelihood function which are optimal solutions for Gradient Descent Gradient Descent can be used to find local optimal solutions The gradient of a function is its rate of change The gradient of a function f is expressed as f At a function s critical points its gradient is 0 Gradient descent is a sequential algorithm for finding local optima that takes steps in the direction of a function s gradient to update a parameter being optimized until it converges on an optimal or near optimal solution The parameter update formula for gradient descent to find min f is t 1 t t f where t is the number of iterations the algorithm has completed is the stepsize of the algorithm i e how far the algorithm moves in the direction of the gradient for iteration t and is the parameter being optimized To use gradient decent to find the optimal for maximum likelihood estimation the following formula should be used t 1 t t L where t is the number of iterations the algorithm has completed is the stepsize of the algorithm i e how far the algorithm moves in the direction of the gradient for iteration t is the parameter being optimized and L is the log likelihood function for the model We must find the gradient of the log likelihood function with respect to the parameter and use that function in the update step of gradient descent because f f Some points to consider regarding gradient descent Gradient Descent finds local optimal solutions thus the starting point of the algorithm matters as different starting points may yield different solutions 5 Optimization Convexity Gradient Descent 5 The function f that gradient descent is performed on should be continuous and differentiable Gradient descent can overshoot the

View Full Document