# ILLINOIS CS 446 - 100517.1 (9 pages)

Previewing pages*1, 2, 3*of 9 page document

**View the full content.**## 100517.1

Previewing pages
*1, 2, 3*
of
actual document.

**View the full content.**View Full Document

## 100517.1

0 0 53 views

- Pages:
- 9
- School:
- University of Illinois - urbana
- Course:
- Cs 446 - Machine Learning

**Unformatted text preview:**

CS446 Machine Learning Fall 2017 Lecture 12 Gaussian Process Regression Lecturer Sanmi Koyejo Scribe Gohar Irfan Chaudhry Oct 5th 2017 Recap Kernel Function Similarity between xi and xj k xxx R We can derive kernel by feature matching as follows k xi xj T xi xj where Rd Rm usually m d Anything that can be written in this way is called a Mercer Kernel D1 x1 xn k x1 x1 k x1 x2 k x1 xn k xn x1 k xn x2 k xn xn We observe that is positive semi definite meaning that it has positive eigenvalues An example of this is Gaussian Kernel which is also called a Radial Basis Function or RBF Kernel and is one of the most popular kernels in practice 1 2 12 Gaussian Process Regression kxi xj k22 2 2 exp kxi xj k22 k xi xj exp where 1 2 2 is a hyperparameter known as a bandwidth parameter that is generally tuned using cross validation When its value is small you are much more sensitive to nearby points X xTi xj w 1 1 1 2 exp kxi xj k2 exp kxi k22 exp kxj k22 2 w 2 2 w 0 Kernels are convenient when doing complex feature matching 1 w xi exp kxi k22 w 2 Ridge Regression min x n X yi xTi w 2 kwk22 i 1 xT1 xT 2 X xTn X is an n x d matrix w X T X I 1 X T y Fact to note I AB 1 A A I BA 1 W X T I XX T 1 y n X T X i xi i 1 12 Gaussian Process Regression 3 where I XX T 1 y W is in the row space of X so we can write W as a weighted sum of X Two Solutions First w X T X I 1 X T y Dominant computation cost is the inverse which is of size d x d so O d3 Second XX T I 1 y w XT Cost is O n3 So we use the first solution when d n when lots of samples and few dimensions otherwise we use the second solution when d n when lots of dimensions and a few samples f xn 1 wT xn 1 xTn 1 w plugging in the solution from above for w xTn 1 X T n X xTn 1 xi i i 1 This gives us prediction for some new x 4 12 Gaussian Process Regression Use feature mappings n X yi wT T xi 2 kwk22 i 1 T x1 T x 2 T xn The first solution as before w T I 1 T y The second solution as before T I 1 y w T f x wT x n X xn 1 T xi i i 1 T x1 T x T 2 T x1 T xn T xn T x1 x1 T x1 xn T xn x1 T xn xn Each cell in the above matrix is a kernel Rewrite it as 12 Gaussian Process Regression 5 k I 1 y n X f x k T k x xi i i 1 k k x x1 k x xn T The original least square problem was as follows n X yi f xi 2 kf k2 i 1 f x n X k x xi i i 1 Here note that w xT wT w kwk22 T T 2 k So min ky k k22 T k k I 1 y when x Rm and m is large and m n then this approach is more efficient This gives us the flexibility to construct complex feature mappings using only kernels Later in the course we will discuss how at this stage multiple i x seem like a 1 layer neural network Support Vector Machine This technique was originally developed for binary classification but can be extended to multi class classification In Fig 1 there are multiple ways to perform the classification as the boundary can be drawn in many different ways 6 12 Gaussian Process Regression Figure 1 Illustration of various margin sizes to separate between the two classes Ref Murphy 2012 Page 500 We want to find points that maximize this margin f x wT x w0 such that f x 0 if y 1 f x 0 if y 1 There are many possible solutions to the above constraint The margin lines on either side of the classification boundary each has a distance d from the classification boundary Thus the total size of the margin can be written as d d 2 kwk2 d d 1 kwk2 where Maximizing the margin we get max w 2 kwk2 such that yi wT xi w0 1 0 i Dn 12 Gaussian Process Regression 7 This is an optimization problem and not easy to solve thus we solve a similar problem min kwk22 w such that yi f xi 1 0 i Dn Solving the problem and taking it into Le Grant Multiplier form n min w X 1 kwk2 i yi f xi 1 2 i 1 w n X i yi x i i 1 where 0 if yi f xi 1 0 if yi f xi 1 x such that f x 1 are known as support vectors These support vectors fully describe the hyperplane For predictions T f x w x n X i yi xTi x i 1 Applying kernel n X i yi k xi x i 1 Most of the i s are zero which means you only need support vectors We can go from number of points equal to N size of entire data to only points equal to the number of support vectors for making predictions min kwk22 w 8 12 Gaussian Process Regression such that yi xi w0 1 Solution w n X i 1 Most of the i s are zero P f x wT x ni 1 i yi xi T x P ni 1 i yi k xi x i yi xi Bibliography Murphy K P 2012 Machine Learning A Probabilistic Perspective The MIT Press 9

View Full Document