DOC PREVIEW
ILLINOIS CS 446 - 100517.1

This preview shows page 1-2-3 out of 9 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 9 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 9 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 9 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 9 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

CS446: Machine Learning, Fall 2017Lecture 12 : Gaussian Process RegressionLecturer: Sanmi Koyejo Scribe: Gohar Irfan Chaudhry, Oct. 5th, 2017RecapKernel FunctionSimilarity between xiand xj.k : xxx → RWe can derive kernel by feature matching as follows:k(xi, xj) = ΦT(xi)Φ(xj)whereΦ : Rd→ Rmusually m > > dAnything that can be written in this way is called a Mercer Kernel.D1=< x1, ..., xn>κ =k(x1, x1) k(x1, x2) ... k(x1, xn)... ... ... ...k(xn, x1) k(xn, x2) ... k(xn, xn)We observe thatκis positive semi-definite (meaning that it has positive eigenvalues). Anexample of this is Gaussian Kernel (which is also called a Radial Basis Function or RBFKernel) and is one of the most popular kernels in practice.12 12 : Gaussian Process Regressionk(xi, xj) = exp −kxi− xjk222σ2= exp −γ kxi− xjk22where γ =12σ2σis a hyperparameter (known as abandwidth parameter) that is generally tuned usingcross validation. When its value is small you are much more sensitive to nearby points.exp (−12kxi− xjk22) =∞Xw=0(xTixj)ww!˙exp(−12kxik22) ˙exp(−12kxjk22)Kernels are convenient when doing complex feature matching.φw(xi) = exp(−12kxik22˙ηw)Ridge RegressionminxnXi=1(yi− xTiw)2+ λ kwk22X =← xT1→← xT2→...← xTn→X is an (n x d) matrix.w∗= (XTX + λI)−1XTyFact to note:(I + AB)−1A = A(I + BA)−1W = XT(λI + XXT)−1y= XTα =nXi=1αixi12 : Gaussian Process Regression 3whereα = (λI + XXT)−1yW is in the row space of X so we can write W as a weighted sum of X.Two Solutions:First:w∗= (XTX + λI)−1XTyDominant computation cost is the inverse which is of size d x d so O(d3).Second:α = (XXT+ λI)−1yw = XTαCost is O(n3).So we use the first solution when d<n (when lots of samples and few dimensions) otherwisewe use the second solution when d > n (when lots of dimensions and a few samples).f(xn+1) = wTxn+1= xTn+1wplugging in the solution from above for w= xTn+1XTα=nXi=1(xTn+1xi)αiThis gives us prediction for some new x.4 12 : Gaussian Process RegressionUse feature mappings:nXi=1(yi− wTφT(xi))2+ λ kwk22Φ =← φT(x1) →← φT(x2) →...← φT(xn) →The first solution, as before:w∗= (ΦTΦ + ΛI)−1ΦTyThe second solution, as before:α = (ΦTΦ + ΛI)−1yw = αΦTf(x) = wTΦ(x) =nXi=1φ(xn+1)Tφ(xi)αiΦΦT=← φT(x1) →← φT(x2) →...← φT(xn) →↑ ↑ ↑φT(x1) ... φT(xn)↓ ↓ ↓=φT(x1)φ(x1) ... φT(x1)φ(xn)... ... ...φT(xn)φ(x1) ... φT(xn)φ(xn)= κEach cell in the above matrix is a kernel.Rewrite it as:12 : Gaussian Process Regression 5α = (k + λI)−1yf(x) = kT∗α =nXi=1k(x, xi)αik = [k(x, x1)...k(x, xn)]TThe original least square problem was as follows:nXi=1(yi− f(xi))2+ λ kfk2f(x) =nXi=1k(x, xi)αiHere, note that:w = xTαwTw = kwk22= αTΦΦTα= α2kαSo,minαky − kαk22+ λαTkαα∗= (k + λI)−1ywhen φ(x) ∈ Rmand m is large and m > > n then this approach is more efficient.This gives us the flexibility to construct complex feature mappings φ using only kernels.Later in the course we will discuss how at this stage, multipleφi(x) seem like a 1-layerneural network.Support Vector MachineThis technique was originally developed for binary classification but can be extended tomulti-class classification. In Fig 1, there are multiple ways to perform the classification asthe boundary can be drawn in many different ways.6 12 : Gaussian Process RegressionFigure 1: Illustration of various margin sizes to separate between the two classes. RefMurphy (2012) Page 500.We want to find points that maximize this margin.f(x) = wTx + w0such that(f(x) > 0, if y = 1f(x) < 0, if y = −1There are many possible solutions to the above constraint.The margin lines on either side of the classification boundary each has a distancedfrom theclassification boundary. Thus, the total size of the margin can be written as:d++ d−=2kwk2whered+= d−=1kwk2Maximizing the margin, we get:maxw2kwk2such thatyi(wTxi+ w0) − 1 ≥ 0∀i∈Dn12 : Gaussian Process Regression 7This is an optimization problem and not easy to solve, thus we solve a similar problem:minwkwk22such thatyif(xi) − 1 ≥ 0∀i∈DnSolving the problem and taking it into Le Grant Multiplier form:minw12kwk2−nXi=1αi(yif(xi) − 1)w =nXi=1αiyixiwhere(α > 0, if yif(xi) = 1α = 0, if yif(xi) > 1x such that f(x) = 1 are known as support vectors. These support vectors fully describe thehyperplane. For predictions:f(x∗) = wTx∗=nXi=1αiyixTix∗Applying kernel:=nXi=1αiyik(xi, x∗)Most of theαis are zero which means you only need support vectors. We can go fromnumber of points equal to N (size of entire data) to only points equal to the number ofsupport vectors for making predictions.minwkwk228 12 : Gaussian Process Regressionsuch thatyi(φ(xi) + w0) ≥ 1Solution:w∗=nXi=1αiyiφ(xi)Most of the αis are zero.f(x∗) = wTx∗=Pni=1αiyiφ(xi)Tφ(x∗)=Pni=1αiyik(xi, x∗)BibliographyMurphy, K. P. (2012). Machine Learning: A Probabilistic Perspective. The MIT


View Full Document

ILLINOIS CS 446 - 100517.1

Download 100517.1
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view 100517.1 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view 100517.1 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?