DOC PREVIEW
CMU CS 10701 - Lecture

This preview shows page 1-2-21-22 out of 22 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 22 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 22 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 22 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 22 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 22 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

1Solving optimal margin classifierz Recall our opt problem:z This is equivalent toz Write the Lagrangian:z Recall that (*) can be reformulated asNow we solve its dual problem: ibxwywiTibw∀≥+ ,)(s.tmax,11ibxwywwiTiTbw∀≤+− ,)(s.tmin,0121[]∑=−+−=miiTiiTbxwywwbw1121)(),,(ααL*( )),,(maxmin,ααbw ibwL0≥),,(minmax,ααbwbwiL0≥2***( )The Dual Problemz We minimize Lwith respect to w and b first:Note that (*) implies: z Plus (***) back to L, and using (**), we have:),,(minmax,ααbwbwiL0≥, ),,(∑==−=∇miiiiwxywbw10ααL, ),,(∑===∇miiibybw10ααL∑==miiiixyw1α*( )∑∑==−=mjijTijijimiiyybw1121,)(),,( xxααααL**( )The Dual problem, cont.z Now we have the following dual opt problem:z This is, (again,) a quadratic programming problem.z A global maximum of αican always be found. z But what's the big deal??z Note two things:1. w can be recovered by 2. The "kernel"∑∑==−=mjijTijijimiiyy1121,)()(max xxαααααJ. ,, , s.t.∑===≥miiiiyki1010ααK∑==miiiiyw1xαjTixxSee next …More later …3Support vectorsz Note the KKT condition --- only a few αi's can be nonzero!!kiwgαii,, ,)( K10==α6=1.4Class 1Class 2α1=0.8α2=0α3=0α4=0α5=0α7=0α8=0.6α9=0α10=0Call the training data points whose αi's are nonzero the support vectors (SV) Support vector machinesz Once we have the Lagrange multipliers {αi}, we can reconstruct the parameter vector w as a weighted combination of the training examples:z For testing with a new data zz Compute and classify z as class 1 if the sum is positive, and class 2 otherwisez Note: w need not be formed explicitly∑∈=SViiiiyw xα()bzybzwSViTiiiT+=+∑∈xα4Interpretation of support vector machinesz The optimal w is a linear combination of a small number of data points. This “sparse” representation can be viewed as data compression as in the construction of kNN classifierz To compute the weights {αi}, and to use support vector machines we need to specify only the inner products (or kernel) between the examples z We make decisions by comparing each new example z with only the support vectors:jTixx()⎟⎠⎞⎜⎝⎛+=∑∈bzyySViTiiixαsign*Non-linearly Separable Problemsz We allow “error” ξiin classification; it is based on the output of the discriminant function wTx+bz ξiapproximates the number of misclassified samplesClass 1Class 25Soft Margin Hyperplanez Now we have a slightly different opt problem:z ξiare “slack variables” in optimizationz Note that ξi=0 if there is no error for xiz ξiis an upper bound of the number of errorsz C : tradeoff parameter between error and margin , ,)(s.tiibxwyiiiTi∀≥∀−≥+01ξξ∑=+miiTbwCww121ξ,minThe Optimization Problemz The dual of this new constrained optimization problem isz This is very similar to the optimization problem in the linear separable case, except that there is an upper bound C on αinowz Once again, a QP solver can be used to find αi∑∑==−=mjijTijijimiiyy1121,)()(max xxαααααJ . ,, ,0 s.t.∑===≤≤miiiiykiC101ααK6Extension to Non-linear Decision Boundaryz So far, we have only considered large-margin classifier with a linear decision boundaryz How to generalize it to become nonlinear?z Key idea: transform xito a higher dimensional space to “make life easier”z Input space: the space the point xiare locatedz Feature space: the space of φ(xi) after transformationz Why transform?z Linear operation in the feature space is equivalent to non-linear operation in input spacez Classification can become easier with a proper transformation. Transforming the Dataz Computation in the feature space can be costly because it is high dimensionalz The feature space is typically infinite-dimensional!z The kernel trick comes to rescueφ( )φ( )φ( )φ( )φ( )φ( )φ( )φ( )φ(.)φ( )φ( )φ( )φ( )φ( )φ( )φ( )φ( )φ( )φ( )Feature spaceInput spaceNote: feature space is of higher dimension than the input space in practice7The Kernel Trickz Recall the SVM optimization problemz The data points only appear as inner productz As long as we can calculate the inner product in the feature space, we do not need the mapping explicitlyz Many common geometric operations (angles, distances) can be expressed by inner productsz Define the kernel function K by∑∑==−=mjijTijijimiiyy1121,)()(max xxαααααJ . ,, ,0 s.t.∑===≤≤miiiiykiC101ααK)()(),(jTijiK xxxxφφ=An Example for feature mapping and kernelsz Consider an input x=[x1,x2]z Suppose φ(.) is given as followsz An inner product in the feature space isz So, if we define the kernel function as follows, there is no need to carry out φ(.) explicitly21222121212221 xxxxxxxx,,,,,=⎟⎟⎠⎞⎜⎜⎝⎛⎥⎦⎤⎢⎣⎡φ=⎟⎟⎠⎞⎜⎜⎝⎛⎥⎦⎤⎢⎣⎡⎟⎟⎠⎞⎜⎜⎝⎛⎥⎦⎤⎢⎣⎡'',2121xxxxφφ()21 ')',( xxxxTK +=8More examples of kernel functionsz Linear kernel (we've seen it)z Polynomial kernel (we just saw an example)where p = 2, 3, … To get the feature vectors we concatenate all pth order polynomial terms of the components of x (weighted appropriately)z Radial basis kernelIn this case the feature space consists of functions and results in a non-parametric classifier.')',( xxxxTK =()pTK ')',( xxxx += 1⎟⎠⎞⎜⎝⎛−−=221'exp)',( xxxxKKernelized SVMz Training:z Using:∑∑==−=mjijijijimiiKyy1,1),(21)(max xxαααααJ . ,, ,0 s.t.∑===≤≤miiiiykiC101ααK()⎟⎠⎞⎜⎝⎛+=∑∈bzKyySViiii,sign* xα9SVM examplesExamples for Non Linear SVMs –Gaussian Kernel10Cross-validation errorz The leave-one-out cross-validation error does not depend on the dimensionality of the feature space but only on the # of support vectors!examples trainingof #ctorssupport ve #error CVout -one-Leave =Machine LearningMachine Learning1010--701/15701/15--781, Fall 2006781, Fall 2006BoostingBoostingEric XingEric XingLecture 9, October 10, 2006Reading: Chap. 14.3, C.B book11Rationale: Combination of methodsz There is no algorithm that is always the most accuratez We can select simple “weak” classification or regression methods and combine them into a single “strong” methodz Different learners use differentz Algorithmsz Hyperparametersz Representations (Modalities)z Training setsz Subproblemsz The problem: how to combine themSome early algorithmsz Boosting by filtering (Schapire 1990)z Run weak learner on differently filtered example setsz Combine weak hypothesesz Requires knowledge on the performance of


View Full Document

CMU CS 10701 - Lecture

Documents in this Course
lecture

lecture

12 pages

lecture

lecture

17 pages

HMMs

HMMs

40 pages

lecture

lecture

15 pages

lecture

lecture

20 pages

Notes

Notes

10 pages

Notes

Notes

15 pages

Lecture

Lecture

22 pages

Lecture

Lecture

13 pages

Lecture

Lecture

24 pages

Lecture9

Lecture9

38 pages

lecture

lecture

26 pages

lecture

lecture

13 pages

Lecture

Lecture

5 pages

lecture

lecture

18 pages

lecture

lecture

22 pages

Boosting

Boosting

11 pages

lecture

lecture

16 pages

lecture

lecture

20 pages

Lecture

Lecture

20 pages

Lecture

Lecture

39 pages

Lecture

Lecture

14 pages

Lecture

Lecture

18 pages

Lecture

Lecture

13 pages

Exam

Exam

10 pages

Lecture

Lecture

27 pages

Lecture

Lecture

15 pages

Lecture

Lecture

24 pages

Lecture

Lecture

16 pages

Lecture

Lecture

23 pages

Lecture6

Lecture6

28 pages

Notes

Notes

34 pages

lecture

lecture

15 pages

Midterm

Midterm

11 pages

lecture

lecture

11 pages

lecture

lecture

23 pages

Boosting

Boosting

35 pages

Lecture

Lecture

49 pages

Lecture

Lecture

22 pages

Lecture

Lecture

16 pages

Lecture

Lecture

18 pages

Lecture

Lecture

35 pages

lecture

lecture

22 pages

lecture

lecture

24 pages

Midterm

Midterm

17 pages

exam

exam

15 pages

Lecture12

Lecture12

32 pages

lecture

lecture

19 pages

Lecture

Lecture

32 pages

boosting

boosting

11 pages

pca-mdps

pca-mdps

56 pages

bns

bns

45 pages

mdps

mdps

42 pages

svms

svms

10 pages

Notes

Notes

12 pages

lecture

lecture

42 pages

lecture

lecture

29 pages

lecture

lecture

15 pages

Lecture

Lecture

12 pages

Lecture

Lecture

24 pages

Midterm

Midterm

5 pages

mdps-rl

mdps-rl

26 pages

Load more
Download Lecture
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?