CMU CS 10701 - SVMs, Duality and the Kernel Trick (cont.) - D2200578

Home> Schools> Carnegie Mellon University> Computer Science (CS) > CS 10701> SVMs, Duality and the Kernel Trick (cont.)

DOC PREVIEW

CMU CS 10701 - SVMs, Duality and the Kernel Trick (cont.)

School name Carnegie Mellon University

Course Cs 10701- Introduction to Machine Learning

Pages 59

This preview shows page 1-2-3-4-27-28-29-30-56-57-58-59 out of 59 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 59 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 59 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 59 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 59 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 59 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 59 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 59 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 59 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 59 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 59 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 59 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 59 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 59 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

©2005-2007 Carlos Guestrin1SVMs, Duality and theKernel Trick (cont.)Machine Learning – 10701/15781Carlos GuestrinCarnegie Mellon UniversityFebruary 28th, 2007©2005-2007 Carlos Guestrin2SVMs reminder©2005-2007 Carlos Guestrin3Dual SVM formulation –the non-separable case©2005-2007 Carlos Guestrin4Reminder from last time: What if thedata is not linearly separable?Use features of features of features of features….Feature space can get really large really quickly!©2005-2007 Carlos Guestrin5Dual formulation only depends ondot-products, not on w!©2005-2007 Carlos Guestrin6Dot-product of polynomials©2005-2007 Carlos Guestrin7Finally: the “kernel trick”! Never represent features explicitly Compute dot products in closed form Constant-time high-dimensional dot-products for many classes of features Very interesting theory – ReproducingKernel Hilbert Spaces Not covered in detail in 10701/15781,more in 10702©2005-2007 Carlos Guestrin8Common kernels Polynomials of degree d Polynomials of degree up to d Gaussian kernels Sigmoid©2005-2007 Carlos Guestrin9Overfitting? Huge feature space with kernels, what aboutoverfitting??? Maximizing margin leads to sparse set of supportvectors Some interesting theory says that SVMs search forsimple hypothesis with large margin Often robust to overfitting©2005-2007 Carlos Guestrin10What about at classification time For a new input x, if we need to represent Φ(x),we are in trouble! Recall classifier: sign(w.Φ(x)+b) Using kernels we are cool!©2005-2007 Carlos Guestrin11SVMs with kernels Choose a set of features and kernel function Solve dual problem to obtain support vectors αi At classification time, compute:Classify as©2005-2007 Carlos Guestrin12Remember kernel regressionRemember kernel regression???1. wi = exp(-D(xi, query)2 / Kw2)2. How to fit with the local points?Predict the weighted average of the outputs:predict = Σwiyi / Σwi©2005-2007 Carlos Guestrin13SVMs v. Kernel RegressionSVMsKernel Regressionor©2005-2007 Carlos Guestrin14SVMs v. Kernel RegressionSVMsKernel RegressionorDifferences: SVMs: Learn weights αi (and bandwidth) Often sparse solution KR: Fixed “weights”, learn bandwidth Solution may not be sparse Much simpler to implement©2005-2007 Carlos Guestrin15What’s the difference betweenSVMs and Logistic Regression?High dimensionalfeatures withkernelsLoss functionNoYes!Log-lossHinge lossLogisticRegressionSVMs©2005-2007 Carlos Guestrin16Kernels in logistic regression Define weights in terms of support vectors: Derive simple gradient descent rule on αi©2005-2007 Carlos Guestrin17What’s the difference between SVMsand Logistic Regression? (Revisited)Almost always no!Often yes!Solution sparseYes!Yes!High dimensionalfeatures withkernelsReal probabilities“Margin”Semantics ofoutputLoss function Log-lossHinge lossLogisticRegressionSVMs©2005-2007 Carlos Guestrin18What you need to know Dual SVM formulation How it’s derived (intuition) The kernel trick Derive polynomial kernel Common kernels Kernelized logistic regression Differences between SVMs and logistic regression©2005-2007 Carlos Guestrin19Announcements Class projects out next week©2005-2007 Carlos Guestrin20PAC-learning, VCDimension andMargin-based BoundsMachine Learning – 10701/15781Carlos GuestrinCarnegie Mellon UniversityFebruary 28th, 2007©2005-2007 Carlos Guestrin21What now… We have explored many ways of learning fromdata But… How good is our classifier, really? How much data do I need to make it “good enough”?©2005-2007 Carlos Guestrin22A simple setting… Classification m data points Finite number of possible hypothesis (e.g., dec. treesof depth d) A learner finds a hypothesis h that is consistentwith training data Gets zero error in training – errortrain(h) = 0 What is the probability that h has more than εtrue error? errortrue(h) ¸ ε©2005-2007 Carlos Guestrin23How likely is a bad hypothesis toget m data points right? Hypothesis h that is consistent with training data !got m i.i.d. points right Prob. h with errortrue(h) ¸ ε gets one data point right Prob. h with errortrue(h) ¸ ε gets m data points right©2005-2007 Carlos Guestrin24But there are many possible hypothesisthat are consistent with training data©2005-2007 Carlos Guestrin25How likely is learner to pick a badhypothesis Prob. h with errortrue(h) ¸ ε gets m data points right There are k hypothesis consistent with data How likely is learner to pick a bad one?©2005-2007 Carlos Guestrin26Union bound P(A or B or C or D or …)©2005-2007 Carlos Guestrin27How likely is learner to pick a badhypothesis Prob. h with errortrue(h) ¸ ε gets m data points right There are k hypothesis consistent with data How likely is learner to pick a bad one?©2005-2007 Carlos Guestrin28Review: Generalization error infinite hypothesis spaces [Haussler ’88] Theorem: Hypothesis space H finite, dataset Dwith m i.i.d. samples, 0 < ε < 1 : for any learnedhypothesis h that is consistent on the training data:©2005-2007 Carlos Guestrin29Using a PAC bound Typically, 2 use cases: 1: Pick ε and δ, give you m 2: Pick m and δ, give you ε©2005-2007 Carlos Guestrin30Review: Generalization error infinite hypothesis spaces [Haussler ’88] Theorem: Hypothesis space H finite, dataset Dwith m i.i.d. samples, 0 < ε < 1 : for any learnedhypothesis h that is consistent on the training data:Even if h makes zero errors in training data, may make errors in test©2005-2007 Carlos Guestrin31Limitations of Haussler ‘88 bound Consistent classifier Size of hypothesis space©2005-2007 Carlos Guestrin32What if our classifier does not havezero error on the training data? A learner with zero training errors may makemistakes in test set What about a learner with errortrain(h) in training set?©2005-2007 Carlos Guestrin33Simpler question: What’s theexpected error of a hypothesis? The error of a hypothesis is like estimating theparameter of a coin! Chernoff bound: for m i.i.d. coin flips, x1,…,xm,where xi 2 {0,1}. For 0<ε<1:©2005-2007 Carlos Guestrin34Using Chernoff bound to estimateerror of a single hypothesis©2005-2007 Carlos Guestrin35But we are comparing manyhypothesis: Union boundFor each hypothesis hi: What if I am comparing two hypothesis, h1 and h2?©2005-2007 Carlos Guestrin36Generalization

View Full Document

CMU CS 10701 - SVMs, Duality and the Kernel Trick (cont.)

Sign up for free to view:

This document and 3 million+ documents and flashcards
High quality study guides, lecture notes, practice exams
Course Packets handpicked by editors offering a comprehensive review of your courses
Better Grades Guaranteed


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-3-4-27-28-29-30-56-57-58-59 out of 59 pages.

CMU CS 10701 - SVMs, Duality and the Kernel Trick (cont.)

Sign up for free to view:

Please select your school