# ILLINOIS CS 446 - 100517.1 (9 pages)

Previewing pages*1, 2, 3*of 9 page document

**View the full content.**## 100517.1

Previewing pages
*1, 2, 3*
of
actual document.

**View the full content.**View Full Document

## 100517.1

0 0 45 views

- Pages:
- 9
- School:
- University of Illinois - urbana
- Course:
- Cs 446 - Machine Learning

**Unformatted text preview: **

CS446 Machine Learning Fall 2017 Lecture 12 Gaussian Process Regression Lecturer Sanmi Koyejo Scribe Gohar Irfan Chaudhry Oct 5th 2017 Recap Kernel Function Similarity between xi and xj k xxx R We can derive kernel by feature matching as follows k xi xj T xi xj where Rd Rm usually m d Anything that can be written in this way is called a Mercer Kernel D1 x1 xn k x1 x1 k x1 x2 k x1 xn k xn x1 k xn x2 k xn xn We observe that is positive semi definite meaning that it has positive eigenvalues An example of this is Gaussian Kernel which is also called a Radial Basis Function or RBF Kernel and is one of the most popular kernels in practice 1 2 12 Gaussian Process Regression kxi xj k22 2 2 exp kxi xj k22 k xi xj exp where 1 2 2 is a hyperparameter known as a bandwidth parameter that is generally tuned using cross validation When its value is small you are much more sensitive to nearby points X xTi xj w 1 1 1 2 exp kxi xj k2 exp kxi k22 exp kxj k22 2 w 2 2 w 0 Kernels are convenient when doing complex feature matching 1 w xi exp kxi k22 w 2 Ridge Regression min x n X yi xTi w 2 kwk22 i 1 xT1 xT 2 X xTn X is an n x d matrix w X T X I 1 X T y Fact to note I AB 1 A A I BA 1 W X T I XX T 1 y n X T X i xi i 1 12 Gaussian Process Regression 3 where I XX T 1 y W is in the row space of X so we can write W as a weighted sum of X Two Solutions First w X T X I 1 X T y Dominant computation cost is the inverse which is of size d x d so O d3 Second XX T I 1 y w XT Cost is O n3 So we use the first solution when d n when lots of samples and few dimensions otherwise we use the second solution when d n when lots of dimensions and a few samples f xn 1 wT xn 1 xTn 1 w plugging in the solution from above for w xTn 1 X T n X xTn 1 xi i i 1 This gives us prediction for some new x 4 12 Gaussian Process Regression Use feature mappings n X yi wT T xi 2 kwk22 i 1 T x1 T x 2 T xn The first solution as before w T I 1 T y The second solution as before T I 1 y w T f x wT x n X xn 1 T xi i i 1 T x1 T x T 2 T x1 T xn T xn T x1 x1 T x1 xn T xn x1

View Full Document