CS446: Machine Learning, Fall 2017Lecture 12 : Gaussian Process RegressionLecturer: Sanmi Koyejo Scribe: Gohar Irfan Chaudhry, Oct. 5th, 2017RecapKernel FunctionSimilarity between xiand xj.k : xxx → RWe can derive kernel by feature matching as follows:k(xi, xj) = ΦT(xi)Φ(xj)whereΦ : Rd→ Rmusually m > > dAnything that can be written in this way is called a Mercer Kernel.D1=< x1, ..., xn>κ =k(x1, x1) k(x1, x2) ... k(x1, xn)... ... ... ...k(xn, x1) k(xn, x2) ... k(xn, xn)We observe thatκis positive semi-definite (meaning that it has positive eigenvalues). Anexample of this is Gaussian Kernel (which is also called a Radial Basis Function or RBFKernel) and is one of the most popular kernels in practice.12 12 : Gaussian Process Regressionk(xi, xj) = exp −kxi− xjk222σ2= exp −γ kxi− xjk22where γ =12σ2σis a hyperparameter (known as abandwidth parameter) that is generally tuned usingcross validation. When its value is small you are much more sensitive to nearby points.exp (−12kxi− xjk22) =∞Xw=0(xTixj)ww!˙exp(−12kxik22) ˙exp(−12kxjk22)Kernels are convenient when doing complex feature matching.φw(xi) = exp(−12kxik22˙ηw)Ridge RegressionminxnXi=1(yi− xTiw)2+ λ kwk22X =← xT1→← xT2→...← xTn→X is an (n x d) matrix.w∗= (XTX + λI)−1XTyFact to note:(I + AB)−1A = A(I + BA)−1W = XT(λI + XXT)−1y= XTα =nXi=1αixi12 : Gaussian Process Regression 3whereα = (λI + XXT)−1yW is in the row space of X so we can write W as a weighted sum of X.Two Solutions:First:w∗= (XTX + λI)−1XTyDominant computation cost is the inverse which is of size d x d so O(d3).Second:α = (XXT+ λI)−1yw = XTαCost is O(n3).So we use the first solution when d<n (when lots of samples and few dimensions) otherwisewe use the second solution when d > n (when lots of dimensions and a few samples).f(xn+1) = wTxn+1= xTn+1wplugging in the solution from above for w= xTn+1XTα=nXi=1(xTn+1xi)αiThis gives us prediction for some new x.4 12 : Gaussian Process RegressionUse feature mappings:nXi=1(yi− wTφT(xi))2+ λ kwk22Φ =← φT(x1) →← φT(x2) →...← φT(xn) →The first solution, as before:w∗= (ΦTΦ + ΛI)−1ΦTyThe second solution, as before:α = (ΦTΦ + ΛI)−1yw = αΦTf(x) = wTΦ(x) =nXi=1φ(xn+1)Tφ(xi)αiΦΦT=← φT(x1) →← φT(x2) →...← φT(xn) →↑ ↑ ↑φT(x1) ... φT(xn)↓ ↓ ↓=φT(x1)φ(x1) ... φT(x1)φ(xn)... ... ...φT(xn)φ(x1) ... φT(xn)φ(xn)= κEach cell in the above matrix is a kernel.Rewrite it as:12 : Gaussian Process Regression 5α = (k + λI)−1yf(x) = kT∗α =nXi=1k(x, xi)αik = [k(x, x1)...k(x, xn)]TThe original least square problem was as follows:nXi=1(yi− f(xi))2+ λ kfk2f(x) =nXi=1k(x, xi)αiHere, note that:w = xTαwTw = kwk22= αTΦΦTα= α2kαSo,minαky − kαk22+ λαTkαα∗= (k + λI)−1ywhen φ(x) ∈ Rmand m is large and m > > n then this approach is more efficient.This gives us the flexibility to construct complex feature mappings φ using only kernels.Later in the course we will discuss how at this stage, multipleφi(x) seem like a 1-layerneural network.Support Vector MachineThis technique was originally developed for binary classification but can be extended tomulti-class classification. In Fig 1, there are multiple ways to perform the classification asthe boundary can be drawn in many different ways.6 12 : Gaussian Process RegressionFigure 1: Illustration of various margin sizes to separate between the two classes. RefMurphy (2012) Page 500.We want to find points that maximize this margin.f(x) = wTx + w0such that(f(x) > 0, if y = 1f(x) < 0, if y = −1There are many possible solutions to the above constraint.The margin lines on either side of the classification boundary each has a distancedfrom theclassification boundary. Thus, the total size of the margin can be written as:d++ d−=2kwk2whered+= d−=1kwk2Maximizing the margin, we get:maxw2kwk2such thatyi(wTxi+ w0) − 1 ≥ 0∀i∈Dn12 : Gaussian Process Regression 7This is an optimization problem and not easy to solve, thus we solve a similar problem:minwkwk22such thatyif(xi) − 1 ≥ 0∀i∈DnSolving the problem and taking it into Le Grant Multiplier form:minw12kwk2−nXi=1αi(yif(xi) − 1)w =nXi=1αiyixiwhere(α > 0, if yif(xi) = 1α = 0, if yif(xi) > 1x such that f(x) = 1 are known as support vectors. These support vectors fully describe thehyperplane. For predictions:f(x∗) = wTx∗=nXi=1αiyixTix∗Applying kernel:=nXi=1αiyik(xi, x∗)Most of theαis are zero which means you only need support vectors. We can go fromnumber of points equal to N (size of entire data) to only points equal to the number ofsupport vectors for making predictions.minwkwk228 12 : Gaussian Process Regressionsuch thatyi(φ(xi) + w0) ≥ 1Solution:w∗=nXi=1αiyiφ(xi)Most of the αis are zero.f(x∗) = wTx∗=Pni=1αiyiφ(xi)Tφ(x∗)=Pni=1αiyik(xi, x∗)BibliographyMurphy, K. P. (2012). Machine Learning: A Probabilistic Perspective. The MIT
View Full Document