DOC PREVIEW
ILLINOIS CS 446 - 091217.2

This preview shows page 1-2-3 out of 8 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

CS446: Machine Learning, Fall 2017Lecture 5 : Optimization: Convexity, Gradient DescentLecturer: Sanmi Koyejo Scribe: Ally Kaminsky, September 12, 2017Recap of previous lectureOverfitting and UnderfittingOverfittingOverfitting of a modelhnoccurs when it is too complex. This means that the hypothesisclass,H, is too big or flexible, i.e. has too many parameters. Overfitting causes a modelperform well on training data, but poorly on testing data.UnderfittingUnderfitting of a model means that its hypothesis class,H, is too small. This means thatthe function form has too few parameters. This case is more difficult to identify thanoverfitting because performance of a model on its training data is almost always better thanits performance on testing data.Bias/Variance TradeoffThe bias of an estimator is:Bias(hn) = E[hn] − θwhere hnis the estimator on some observed data, and θ is an unknown, fixed constant.The variance of an estimator hnis defined as:V ariance(hn) = E[(E[hn] − hn)2]wherehnis an estimator on some observed data. Bias and variance help us understand errortradeoffs in a model, which applies to all supervised learning tasks. The expected risk of amodel can be expressed as:E[R(hn, Dn)] = γ(Bias, V ariance)12 5 : Optimization: Convexity, Gradient Descentwhere hnis is an estimator on some observed data Dn.Selecting and Training ModelsThere are several ways to select and train models:•Come up with an algorithm that works well (i.e. engineer an algorithm). This is not areasonable, systematic approach but works well in practice sometimes.• Empirical Risk Minimization (ERM)–Find a modelhnsuch thathn=argminh∈HR(h, Dn) whereHis the hypothesis classof the model and R(h, Dn) is the risk of a model h given dataset Dn.– The resulting model from ERM is very tied to the risk function.• Probabilistic model approach– Find a good approximationˆP ≈ P–Find a modelhnsuch thathn=argminh∈FR(h,ˆP) whereFis the class of allpossible prediction functions and R(h,ˆP ) is the risk of a model h givenˆP .– Benefits of the probabilistic model approach:∗You do not need to start from scratch for each risk function becauseˆPhasalready been found. For this reason, probabilistic models can be reused.∗ Probabilistic models can incorporate domain knowledge.Fitting a modelMotivating ExampleThe goal of the following is to find a model that explains some data well. SupposeX={0,1}and that datasetDn={x1, x2, ..., xn}. AssumeDnis sampled independent and identicallydistributed (iid) from the Bernoulli distributionB(θ), for some parameterθ. This meansthat we will sampleX ∼ P(x|θ) as 1 with probabilityθand 0 with probability 1− θ. Thatis,P (X|θ) = θx(1 − θ)1−xThe conditional probabilities that x = 0 and x = 1 given θ are:P (x = 0|θ) = 1 − θP (x = 1|θ) = θ5 : Optimization: Conve xity, Gradient Descent 3Methods for fitting a modelTo fit a model to the data from the example above, we need to find the parameterθ. Thiscan be done in several ways, including:• Maximum Likelihood Estimation (MLE)• Maximum a-posterior (MAP)• Bayesian modelMaximum Likelihood EstimationMaximum Likelihood Estimation is a method of fitting a model to some observed data byfinding the parameter values of the model that maximize the likelihood of seeing the observeddata values given those parameters. The likelihood of seeing some observed dataset givenparameter values can be expressed asP(Dn|θ). To maximize the likelihood, we want to findargmaxθP(Dn|θ). The log-likelihood functionLis often used to find the maximum likelihoodbecause finding its maximum is equivalent to finding the likelihood function’s maximum andit makes calculations easier. The log likelihood function is:L(θ) =log(P(Dn|θ)). The endgoal of maximum likelihood estimation then is to find:θ∗= argminθL(θ)Bernoulli Case ExampleContinuing the previous example, the probability of seeing a single data observationxigivenparameterθisP(xi|θ) =θxi(1− θ)1−xi. This can be thought of as, ”for a fixedθ, how likelyis it to observe 1 or 0?” We want to know the probability of a dataset of observations (ratherthan a single data observation) givenθ. This can be expressed asP(Dn|θ) =nQi=1P(xi|θ),givenP(xi|θ) and that the data was sampled iid. The log-likelihood function for the iid caseis:L(θ) =nXi=1log(P (xi|θ))We can then substituteP(xi|θ) for the Bernoulli distribution example into the log-likelihoodfunction. The Bernoulli log-likelihood function is then:L(θ) =nXi=1(xilogθ + (1 − xi)log(1 − θ))wherenis the number of observations in the dataset,xiis theithobservation in the dataset,and θ is a parameter.4 5 : Optimization: Conve xity, Gradient DescentOptimization MethodsTo find the best parameter to fit a model to some observed data, we need to find:ˆθ = argminθL(θ)This is an optimization problem, thus we can use various optimization methods to findˆθincluding:• Solving directlyTo find the optimal solution forˆθ, we can solve∂L(θ)∂θ= 0. This method uses calculusto find the critical points of the log-likelihood function, which are optimal solutionsforˆθ.• Gradient DescentGradient Descent can be used to find local optimal solutions. The gradient of afunction is its rate of change. The gradient of a functionfis expressed as∇f. At afunction’s critical points, its gradient is 0. Gradient descent is a sequential algorithmfor finding local optima that takes steps in the direction of a function’s gradient toupdate a parameter being optimized until it converges on an optimal (or near-optimalsolution). The parameter update formula for gradient descent to find minθf(θ) is:θt+1= θt+ γt∂f (θ)∂θwheretis the number of iterations the algorithm has completed,γis the stepsize ofthe algorithm (i.e. how far the algorithm moves in the direction of the gradient foriteration t), and θ is the parameter being optimized.To use gradient decent to find the optimalˆθfor maximum likelihood estimation, thefollowing formula should be used:θt+1= θt+ γt∂L(θ)∂θwheretis the number of iterations the algorithm has completed,γis the stepsize ofthe algorithm (i.e. how far the algorithm moves in the direction of the gradient foriterationt),θis the parameter being optimized, andL(θ)is the log-likelihood functionfor the model.We must find the gradient of the log-likelihood function with respect to the parameterθand use that function in the update step of gradient descent because∇f(θ) =∂f


View Full Document

ILLINOIS CS 446 - 091217.2

Download 091217.2
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view 091217.2 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view 091217.2 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?