UT Dallas CS 6375 - 16.svm-2 - D3105016

Home> Schools> University of Texas at Dallas> Computer Science (CS) > CS 6375> 16.svm-2

DOC PREVIEW

UT Dallas CS 6375 - 16.svm-2

School name University of Texas at Dallas

Course Cs 6375- Machine Learning

Pages 16

This preview shows page 1-2-3-4-5 out of 16 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 16 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 16 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 16 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 16 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 16 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 16 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

11Machine LearningCS6375 --- Spring 2015aSupport Vector Machines (II)2The Dual Problem∑∑ ∑∑== ===≥=−=RiiiijijiijijjRiRjiRiiyxxyyGGaaaL11 110,0:subject to'21 MaximizeααSVfor 0≠=∑iiiiixywαα23But what if the data is not linearly separable?Idea 1: Find minimum w.w, whileminimizing number oftraining set errors.Problematic: Two thingsto minimize makes foran ill-defined optimization4Idea 1.1:Minimizew.w + C (#train errors)There’s a practical problem with this approachIt doesn’t distinguish between disastrous errors and near misses.But what if the data is not linearly separable?35Idea 1.2:Minimizew.w + C (distance of errorpoints to theircorrect place)But what if the data is not linearly separable?6Learning the Maximum Margin ClassifierGiven guess of w , b we can• Compute sum of distances of points to their correct zones• Compute the margin widthAssume R data points, each(xk,yk) where yk= +/- 1What should our quadraticoptimization criterion be?How many constraints will wehave?What should they be?47Learning the Soft Margin ClassifierGiven guess of w , b we can• Compute sum of distances of points to their correct zones• Compute the margin widthAssume R datapoints, each(xk,yk) where yk= +/- 1What should our quadraticoptimization criterion be?MinimizeHow many constraints will wehave? 2RWhat should they be?w . xk+ b >= 1-εkif yk= 1w . xk+ b <= -1+εkif yk= -1εk>= 0 for all k8Learning the Soft Margin ClassifierGiven guess of w , b we can• Compute sum of distances of points to their correct zones• Compute the margin widthAssume R datapoints, each(xk,yk) where yk= +/- 1What should our quadraticoptimization criterion be?MinimizeHow many constraints will wehave? 2RWhat should they be?w . xk+ b >= 1-εkif yk= 1w . xk+ b <= -1+εkif yk= -1εk>= 0 for all kOur original (noiseless data) QP had m+1 variables: w1, w2, … wm, and b.Our new (noisy data) QP has m+1+Rvariables: w1, w2, …, wm, b, ε1, …, εR59Controlling Soft Margin SeparationC controls trade-off between margin and error.10The Dual QPData points with αk> 0 will be the support vectors, so this sum only needs to be over the support vectors.611Soft Margin Dual ProblemOne factor αifor each training example0 < αi< C  SV with ξi = 0αi= C  SV with ξi > 0αi= 0 otherwise12SVM Training and TestingRemember: dot product of instances in both training and testingTraining:Testing: jijiijijjRiRjiRiixxyyGGaaaL'211 11=−=∑ ∑∑= ==bxxybxwSVxiiii+•=+•∑∈**α713Example: suppose we are in 1-dimensionWhat would SVMs do with this data?Not a big surprise.14Harder 1-dimensional DatasetWhat can be done about this?Idea 1: use soft margin, allow some errors815Harder 1-dimensional DatasetLet’s project the data points to a 2D space using non-linearbasis functions.16Harder 1-dimensional DatasetLet’s project the data points to a 2D space using non-linearbasis functions.Now the data is linearly separable.917Learning in the Feature SpaceMap data into a feature space where they are linearly separable x->Φ(x) 18Non-Linear SVMs: Feature SpacesGeneral idea: the original feature space can be mapped to other feature space (likely higher dimensional) where the training set is separable:Φ1019QP with Basis Functionskkkktskkkkxwxybxywk SVfor )()1()(0..•Φ−−=Φ=∑>εαα))((),,( bxwsignbwxf+Φ•=Classify with: This is sensible. Is that the end of the story? No…there’s one more trick!20Quadratic Basis FunctionsNumber of terms (assuming m inputdimensions) = m2/2What are those √2 ’s doing?•You should be happy they do no harm.•You’ll find out why they’re there soon.1121QP with Basis Functionskkkktskkkkxwxybxywk SVfor )()1()(0..•Φ−−=Φ=∑>εαα))((),,( bxwsignbwxf+Φ•=Classify with: We must do R2/2 dot products toget this matrix ready.Each dot product requires m2/2additions and multiplications.The whole thing costs R2m2/4.This is computationally expensive!22Quadratic Dot Products1223Quadratic Dot ProductsLet’s look at another function of a and b:They’re the same!What’s the complexity of each? m vs. m224QP with Basis Functionskkkktskkkkxwxybxywk SVfor )()1()(0..•Φ−−=Φ=∑>εαα))((),,( bxwsignbwxf+Φ•=Classify with: We must do R2/2 dot products toget this matrix ready.Each dot product now only requiresm additions and multiplications(The Kernel Trick))()()(lklkxxKxx•=Φ•ΦCalled kernel/gram matrix1325Higher Order Polynomials26QP with Basis FunctionsWhat about testing? Expensive in the new feature space?))((),,( bxwsignbwxf+Φ•=Example: 5-degree polynomial1427Overfitting? Curse of dimensionality?• Mapping to high dimensional space x -> Ф(x)• Are we overfitting the training data? • The use of Maximum Margin magically makes this not a problem• No matter what the basis function is, there are only up to R parameters: α1, α2, …, αR, and usually most are set to zero by the Maximum Margin.28SVM Kernel Functions• K(a,b)=(a . b +1)dis an example of an SVM Kernel Function• Beyond polynomials there are other very high dimensional basis functions that can be made practical by finding the right Kernel Function• Radial-Basis-style Kernel Function:• Neural-net-style Kernel Function:• String kernel, tree kernel, ….1529KernelsA function that returns the value of the dot product between the images of the two argumentsGiven a function K, it is possible to verify that it is a kernelCan construct kernels from simple ones30SVM Performance• Anecdotally they work very very well indeed.• A lot of empirical and theoretical studies.• Can use ensemble ideas to do multi-class classification1631What you should know• Linear SVMs• The definition of a maximum margin classifier• What QP can do for you (for this class, you don’t need to know how it does it)• How Maximum Margin can be turned into a QP problem• How we deal with noisy (non-separable) data• How we permit non-linear boundaries• How SVM Kernel functions permit us to pretend we’re working with ultra-high-dimensional basis function

View Full Document


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-3-4-5 out of 16 pages.

UT Dallas CS 6375 - 16.svm-2

Sign up for free to view:

Please select your school