# ILLINOIS CS 446 - 100317.3 (7 pages)

Previewing pages*1, 2*of 7 page document

**View the full content.**## 100317.3

Previewing pages
*1, 2*
of
actual document.

**View the full content.**View Full Document

## 100317.3

0 0 53 views

- Pages:
- 7
- School:
- University of Illinois - urbana
- Course:
- Cs 446 - Machine Learning

**Unformatted text preview:**

CS446 Machine Learning Fall 2017 Lecture 11 SVMs Kernel methods Lecturer Sanmi Koyejo Scribe Yiting Pan Oct 3rd 2017 Recap Output Representation When we are making prediction we have ways to represent our targets that we are trying to predict In regression the output space is y R In binary classification the idea is to tell apart two choices objects and the standard representation is 1 1 or 0 1 The mapping from 0 1 to 1 1 is 2y 1 and reversely is y2 21 In multiclass classification we want to tell apart c choices so the representation is 1 2 3 c c Which is equivalent to 0 1 2 c 1 c In one hot encoding we use an example to demonstrate the representation c 4 1 1 1 0 0 0 2 2 0 1 0 0 3 3 0 0 1 0 4 4 0 0 0 1 For most cases instead of c we could bring 0 1 c and either way to get the same answer More generally 0 0 0 1 0 0 1 5 Pc i 1 yi 1 We can represent 2 11 SVMs Kernel methods Where 1 is in the k t h position and others are 0 Or we can write this algebracally 0 1 c 4c 1 Where the triangle is a way to write simplex which is a set of vectors sum to 1 Following is further explanation x 4c 1 c x R and X xi 1 xi 0 i 1 Multilabel classification problem Pick a subset from c choices Suppose we predict k things out of c yi 0 1 0 1 1 2 4 5 The 0 and 1 above represents if item i is in the set and we want to predict the 2 4 and 5th items associated with this example Or we can write as y c 1 2 3 c Where y can pick any subset of entries from c The following is the mathematical definition for y is subset of 1 through c y powerset of c For binary representation 0 1 c 2c Alternative is multiclass classification number of classes 2c Because exponential grows super quickly 2c can get very big If for example we are dealing with the document labeling problem the number of potential topics documents can easily be a thousand and 21 000 is infeasible to solve in a large multiclass classification problem But it is feasible to solve in the binary case that we only need to solve a thousand binary classification problems Bagging bootstrap aggregating Resample with replacement Train a model with each sample 11 SVMs Kernel methods 3 Predict using average over models Key Ideas Averaging reduces variances vs single predictor Minimal effect on bias Tends to reduce overall error Example of ensemble algorithm Bias variance using Darts In this example there are a bunch of different players trying to throw the darts to hit the bull s eye Each player has different performance that are illustrated below Figure 1 Darts example Player 1 High bias average prediction is far from bull s eye Small variance Player 2 Low bias average is close to bull s eye High variance Bias is the difference between the average prediction we make and the true prediction The ideal player would have both small bias and small variance Our job is to look for functions of function classes that give the best tradeoff between being centered around right answers and how noisy they are In class we will do bias variance for risk functions and in homework we will do bias variance for parameters Surrogate loss functions Alternative loss function that achieves the same risk with large samples as optimizing the true loss risk function 4 11 SVMs Kernel methods We gave an example of 0 1 loss when prediction h sign f The 0 1 loss looks like 1 yi f xi 0 The less than 0 means the sign of f x and sign of y are different If they are the same it will be greater than 0 and that means we are making good prediction Different surrogate functions log loss log 1 e yfi exp loss e yfi Figure 2 Various loss functions From logistic regression to logloss f x wT x w0 p y x f x 1 y 1 1 f x 1 y 1 1 x x logp y x log 1 1 1 y 1 1 y 1 f x i 1 e 1 ef xi 1 y 1 log 1 e f xi 1 y 1 log 1 ef xi log 1 e yf x Feature Transformations Linear classification y sign wT x w0 11 SVMs Kernel methods 5 Figure 3 Linear classification There is a circle that perfectly separate things We look at how far away the radius is and that is a way to get a perfect linear classifier for this problem Transform data x x1 x2 x21 x22 Figure 4 Three dimensional space illustration the green area is the separating hyperplane we are trying to find We have three coordinates x1 x2 and x21 x22 in this illustration This figure is from online resource Some cases useful for reducing the bias of model e g linear model f wT x w0 w T x w 0 left function linear in input features right function assuming is nonlinear linear in transformed features but non linear in input features Often used when input x is not in euclidean space Example 6 11 SVMs Kernel methods We are trying to make a prediction using strings or choice data Suppose we ask people which building they like most in campus by giving them a survey The output is the choice which is not easily representable as a number Suppose we have 10 buildings 1 2 10 It s meaningless to use euclidean distance between building 8 and 2 which is 8 2 Instead we can use one hot encoding as the feature transformation x Now if ask about the distance between building 8 and 2 there are sort of unit distances between each other and there is no euclidean distance because of the difference of the numbers Many machine learning models can be written and solved in terms of inner products e g Linear regression xi T xi i j feature transformation xi T xj similarity between points Kernel Trick Instead of bothering to write T we can define a function and directly derive the kernel function k xi xj xi T xi derive the kernel function directly Focus on kernels that are equivalent to inner products Known as Mercer Kernels Positive semidefinite For any collection of inputs x1 xn k x1 x1 k x1 xn k is k xn x1 k xn xn psd 11 SVMs Kernel methods 7 Eigenvalues of k 0 RBF Kernels k xi xj exp xi xj 22 2 2 This is also called Gaussian Kernel and parameter 2 is known as the bandwith The 0 s for RBF kernel are infinite dimensional We can do computation with infinite dimensional objects in a tractable principled way The trick is to actually avoid …

View Full Document