DOC PREVIEW
ILLINOIS CS 446 - 091417.3

This preview shows page 1 out of 3 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 3 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 3 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

CS446: Machine Learning, Fall 2017Lecture 6 : Penalized maximum likelihood: Sparsity: Forward / Backward selectionLecturer: Sanmi Koyejo Scribe: Yuqi Zhang, Sep. 21st, 2017RecapNaive BayesThe basic idea isP (x | y) =NYi=1P (xi| y)Once we have above function, we can compute thatP (y | x)ispropositionaltoP (x | y)P (y)Maximum likelihoodThe idea is that we want to maximize dataset D with the best θP (Dn| θ)There are several ways to do this• log likelihoodl(θ) = log P (Dn| θ)We can use one of the following two major methods can be used to computer argminθl(θ)dl(θ)dθ= 0• Gradient Descent Initial set θ0then recursively updateθt+1= θt+ γtd(−l(θ))dθ12 6 : Penalized maximum likelihood: Sparsity: Forward / Backward selectionLinear RegressionCreating a model P (y | x) N (m(x), σ2) and m(x) = wTx, where w ∈ Rdand θ2∈ R+Therefore, we have θ = {w, σ2}the log likelihood function is:l(θ) =nXi=1log[(12πσ2)12exp(−12σ2(yi− wTxi)2)]=nXi=1−12σ2(yi− wTxi)2−12log(2σ2π)=12σ2RSS(w) −n2(2πσ2)for function above, RSS represents residual sum of equations, which is the sum of the squaresof residuals (deviations predicted from actual empirical values of data). It is a measure ofthe discrepancy between the data and an estimation model.minw,σ2= −l(σ) = minσ2[−N2(2πσ2) +12σ2minwRSS(w)]minwRSS(w) = minwnXi=1(yi− wTxi)2whereyi∈ R, wT, xi∈ RdWe rewrite above function by making vector y ∈ Rn=y1y2. .ynvector x ∈ Rn×d=xT1xT2. . .xTnRSS(w) =12|| y − xw||22=12(y − xw)T(y − xw)=12wT(xTx)w − wT(xTy) + constdRSS(w)dw= xTxw − xTy = 0xTxw = xTyw = (xTx)−1xTy6 : Penalized maximum likelihood: Sparsity: Forward / Backward selection 3Penalized ModelIt is not always a good idea to take data and fit the modelFailure case of Lease Squares:w = (xTx)−1xyfor above function, ifxTxis singular, then the inverse will fail and if n ¡ d, thenxTxwillalways be singularprobabilistic viewThe goal is to include the prior distribution of the model parameter.Common priors are gaussianp(θ) = w(0, Λ2)Symbols(MAP - maximum a posteriori)θ = argmaxθp(Dn| θ)p(θ)maxθp(θ | Dn) = maxθp(Dn| θ)p(θ)p(Dn)g(θ) =−12σ2RSS(W ) −N2log(2rσ2) −|| w ||222λ2+12log(2rλ2)= −RSS(w) −σ2λ2|| w ||22=|| y − xw||22+σ2λ2|| w


View Full Document

ILLINOIS CS 446 - 091417.3

Download 091417.3
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view 091417.3 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view 091417.3 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?