DOC PREVIEW
ILLINOIS CS 446 - 100317.2

This preview shows page 1-2 out of 7 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 7 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 7 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 7 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

CS446: Machine Learning, Fall 2017Lecture 10 : SVMs, Kernel methodsLecturer: Sanmi Koyejo Scribe: Xingyu Xiang, Oct. 3th, 2017Agenda• Recap• Kernels• Support Vector Machines (SVM)Output Presentation• Regression: y = R• Binary Classification: tell apart two choicesboth{−1, 1}and{0, 1}are valid representation. You can use12(y+ 1) mapping{−1, 1} to {0, 1} and you can use 2y − 1 mapping {0, 1} to {−1, 1}.• Multiclass classification: tell apart c choices{1, 2, 3...c} = [c]where {1, 2, 3...c} is equivalent to {0, 1, 2...c − 1}• One-hot encoding representatione.g c = 41 [1, 0, 0, 0]2 [0, 1, 0, 0]3 [0, 0, 1, 0]4 [0, 0, 0, 1][c]− > {0, 1}candPci=1yi= 1You can think about it in a more general way,ei= [0,0,0, ...,1,0, ...,0] where 1 only12 10 : SVMs, Kernel methodsexits in the i’th position.Or you can write it as mathmatical notation:{0, 1}c∩ 4c−1where x ∈ 4c−1, x ∈ R+andPxi= 0, xi≥ 0Multilabel Classification Problem• Multilabel classification problem: Pick a subset from c choices.• In a binary representtaion setting:yi= [0, 1, 0, 1, 1] = {2, 4, 5}where [0, 1, 0, 1, 1] represent if item i is in the set.• y ∈ [c] = {1, 2, 3, 4, ..., c}In a mathmatical way, we write it as y ∈powerset of ([c])• Binary Representation:| {0, 1}c| = 2cAlternative in multiclass classification:number of classes = 2c(2ccan be very big)Recap(Bagging, also called Bootstrap aggregating)• – Resample with replacement.– Train a model with each sample.– Prediction using average over the models.• Key Ideas– Averaging reduces variance vs. single predictor;– Minimal effect on bias;– Tends to reduce overall error.– This is an example we called ”ensemble algorithm”.• Bias/Variances using Darts:– For the upper left dart in Figure 1, we could observe:∗ High bias: Average prediction is far from the balls eye.∗ Low variace10 : SVMs, Kernel methods 3Figure 1: Illustrate bias/variance using darts dar– For the bottom right dart in Figure 1, we could observe:∗ Low bias: Average prediction is closer to balls eye.∗ High variace– ∗ In class, we are doing bais/variance for risk functions.∗ In homework, we are doing bais/variance for parameters.• Surrogate Loss Functions–Alternative loss function that achieves the same risk (with large samples) asoptimizing the true loss/risk function.– 0/1 Loss h = sign(f )1[yif(xi) < 0]– Log Loss: log(1 + e−yfi)– Exp Loss: e−yfiFigure 2: Illustration of various loss functions for binary classification. The horizontal axisis the margin y, the vertical axis is the loss. The log loss uses log base 2. Figure generatedby hingeLossPlot.Murphy (2012)• From Logistic Regression to Logloss4 10 : SVMs, Kernel methods– Logistic Regression:p(y|x) = ([σ(f(x))]1(y=1)) · ([1 − σ(f(x))]1(y=−1))where f (x) = wTx + w0logp(y|x) = log[11 + e−f(xi)1(y=1)·11 + ef(xi)1(y=−1)]= 1(y = 1)[log(1 + e−fi)] · 1(y = −1)[log(1 + efi)]= log(1 + e−yf(x))where fi= f(xi)Feature Transformations•Suppose we’re interested in linear classification, but my data looks like the leftimage in Figure 3. The problem is whether it is possible to find a linear classifiery = sign(wTx + w0) to seperate the positive and negative data?• It is possible and the way is to transform the data.φ(x) = [x1, x2, x21+ x22]In the 3D space, there will be a hyperplane to seperate the points. 3.Figure 3: feature transformation• Some cases useful for reducing bias of model.e.g. Linear Modelf = wTx + w0− > wTφ(x) + w0left side: liner in input features.right side: ifφis non linear, linear in transformed features while non-linear in inputdata.10 : SVMs, Kernel methods 5•Often used when x is not Euclidean. e.g. Suppose we have 10 buildings [1,2,3, ...,10],and we can choose the building. It doesn’t make sense to calculate 8−2. However, wecan use one-hot encoding as feature transformation.φ(x) = one-hot encoding[1, 0, ...][0, ..., 10]• Many machine learning models can be written in terms of inner product.{xi, xj}i,je.g. Linear Regressionφ(xi)Tφ(xj) < − Similarity between points.• ”Kernel Trick”K(xi, xj) = φ(xi)Tφ(xj)We can derive the kernel function directly.– In our class, we focus on kernels that are equivalent to inner products.– Known as Mercer kernels.– property of the kernel: positive semi-definite.K =K(x1, x1) . . . K(x1, xn).........K(xn, x1) . . . K(xn, xn)where K is positive semi-definite, meaning eigenvalues of K ≥ 0.– RBF kernelsK(xi, xj) = exp(−||xi− xj||222σ2)where σ2called bandwidth.– cool thing about RBF :φ for RBF kernel are infinite dimensinal!K(xi, xj) =Zwφw(xi)φw(xj)dw– Problem is solved like:x − > φ(x) − > φ(xi)T= K(xi, xj)– Other kernels in Books:∗ Matern Kernels 14.2.5∗ String Kernels 14.2.6∗ Pyramid Match Kernels (for images) 14.2.76 10 : SVMs, Kernel methodsNext Week• Kernel Algorithms• SVM’s• RegressionBibliography(????).URLhttp://conglang.github.io/2016/04/04/note-a-few-useful-things-to-know-about-machine-learning/Murphy, K. P. (2012). Machine learning: a probabilistic perspective. MIT


View Full Document

ILLINOIS CS 446 - 100317.2

Download 100317.2
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view 100317.2 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view 100317.2 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?