# ILLINOIS CS 446 - 100317.2 (7 pages)

Previewing pages*1, 2*of 7 page document

**View the full content.**## 100317.2

Previewing pages
*1, 2*
of
actual document.

**View the full content.**View Full Document

## 100317.2

0 0 50 views

- Pages:
- 7
- School:
- University of Illinois - urbana
- Course:
- Cs 446 - Machine Learning

**Unformatted text preview:**

CS446 Machine Learning Fall 2017 Lecture 10 SVMs Kernel methods Lecturer Sanmi Koyejo Scribe Xingyu Xiang Oct 3th 2017 Agenda Recap Kernels Support Vector Machines SVM Output Presentation Regression y R Binary Classification tell apart two choices 1 y 1 mapping 2 1 1 to 0 1 and you can use 2y 1 mapping 0 1 to 1 1 both 1 1 and 0 1 are valid representation You can use Multiclass classification tell apart c choices 1 2 3 c c where 1 2 3 c is equivalent to 0 1 2 c 1 One hot encoding representation e g c 4 1 1 0 0 0 2 0 1 0 0 3 0 0 1 0 4 0 0 0 1 Pc c 0 1 c and i 1 yi 1 You can think about it in a more general way ei 0 0 0 1 0 0 where 1 only 1 2 10 SVMs Kernel methods exits in the i th position Or you can write it as mathmatical notation 0 1 c 4c 1 where x 4c 1 x R and P xi 0 xi 0 Multilabel Classification Problem Multilabel classification problem Pick a subset from c choices In a binary representtaion setting yi 0 1 0 1 1 2 4 5 where 0 1 0 1 1 represent if item i is in the set y c 1 2 3 4 c In a mathmatical way we write it as y powerset of c Binary Representation 0 1 c 2c Alternative in multiclass classification number of classes 2c 2c can be very big Recap Bagging also called Bootstrap aggregating Resample with replacement Train a model with each sample Prediction using average over the models Key Ideas Averaging reduces variance vs single predictor Minimal effect on bias Tends to reduce overall error This is an example we called ensemble algorithm Bias Variances using Darts For the upper left dart in Figure 1 we could observe High bias Average prediction is far from the balls eye Low variace 10 SVMs Kernel methods 3 Figure 1 Illustrate bias variance using darts dar For the bottom right dart in Figure 1 we could observe Low bias Average prediction is closer to balls eye High variace In class we are doing bais variance for risk functions In homework we are doing bais variance for parameters Surrogate Loss Functions Alternative loss function that achieves the same risk with large samples as optimizing the true loss risk function 0 1 Loss h sign f 1 yi f xi 0 Log Loss log 1 e yfi Exp Loss e yfi Figure 2 Illustration of various loss functions for binary classification The horizontal axis is the margin y the vertical axis is the loss The log loss uses log base 2 Figure generated by hingeLossPlot Murphy 2012 From Logistic Regression to Logloss 4 10 SVMs Kernel methods Logistic Regression p y x f x 1 y 1 1 f x 1 y 1 where f x wT x w0 logp y x log 1 y 1 1 y 1 1 1 1 e f xi 1 ef xi 1 y 1 log 1 e f i 1 y 1 log 1 ef i log 1 e yf x where fi f xi Feature Transformations Suppose we re interested in linear classification but my data looks like the left image in Figure 3 The problem is whether it is possible to find a linear classifier y sign wT x w0 to seperate the positive and negative data It is possible and the way is to transform the data x x1 x2 x21 x22 In the 3D space there will be a hyperplane to seperate the points 3 Figure 3 feature transformation Some cases useful for reducing bias of model e g Linear Model f wT x w0 wT x w0 left side liner in input features right side if is non linear linear in transformed features while non linear in input data 10 SVMs Kernel methods 5 Often used when x is not Euclidean e g Suppose we have 10 buildings 1 2 3 10 and we can choose the building It doesn t make sense to calculate 8 2 However we can use one hot encoding as feature transformation x one hot encoding 1 0 0 10 Many machine learning models can be written in terms of inner product xi xj i j e g Linear Regression xi T xj Similarity between points Kernel Trick K xi xj xi T xj We can derive the kernel function directly In our class we focus on kernels that are equivalent to inner products Known as Mercer kernels property of the kernel positive semi definite K x1 x1 K K xn x1 K x1 xn K xn xn where K is positive semi definite meaning eigenvalues of K 0 RBF kernels K xi xj exp xi xj 22 2 2 where 2 called bandwidth cool thing about RBF for RBF kernel are infinite dimensinal Z K xi xj w xi w xj dw w Problem is solved like x x xi T K xi xj Other kernels in Books Matern Kernels 14 2 5 String Kernels 14 2 6 Pyramid Match Kernels for images 14 2 7 6 10 SVMs Kernel methods Next Week Kernel Algorithms SVM s Regression Bibliography URL http conglang github io 2016 04 04 note a few useful things to know about machine l Murphy K P 2012 Machine learning a probabilistic perspective MIT press 7

View Full Document