MIT 9 520 - Multiclass Classification - D2135631

Home> Schools> Massachusetts Institute of Technology> Brain and Cognitive Sciences (9) > 9 520> Multiclass Classification

DOC PREVIEW

MIT 9 520 - Multiclass Classification

School name Massachusetts Institute of Technology

Course 9 520- Statistical Learning Theory and Applications

Pages 59

This preview shows page 1-2-3-4-27-28-29-30-56-57-58-59 out of 59 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 59 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 59 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 59 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 59 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 59 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 59 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 59 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 59 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 59 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 59 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 59 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 59 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 59 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Multiclass Classification9.520 Class 06, 25 Feb 2008Ryan Rifkin“It is a ta l eTold by an idiot, full of sound and fury,Signifying nothing.”Macbeth, Act V, Scene VWhat Is Multiclass Classification?Each training point belongs to one of N different classes.The goal is t o c onstruct a function which, given a newdata point, will correctly predict the class to which thenew point belongs.What Isn’t M u lticlass Classification?There are many scenarios in which there are multiple cate-gories to which points belong, but a given point can belongto multiple cate gories. In its most basic for m , this problemdecomposes triviall y into a set of unlinked binary pr oblems,which ca n be solved nat urally using our techniques for bi-nary classifica tion.A First IdeaSuppose we knew t he density, pi(x), for each of the Nclasses. Then, we would predict usingf(x) = arg maxi∈1,...,Npi(x).Of course we don’t know the densities, but we could esti-mate them using classical tec hniques.The Problem With Densities, andMotivationEstimating densities is hard, especially in high dim e nsionswith limited data.For binary classification tasks, we have seen that directlyestimating a smooth separating function gives better re-sults than density estimation (SVM, R LS C). Can we exte ndthese approaches usefully to the multicla ss scenario?A Simple Idea — One-vs-AllClassificationPick a good technique for building binary c lassifiers (e.g.,RLSC, SVM). Build N different binary cla ssifiers. For t heith classifier, let the positive examples be all t he points inclass i, and let the negative examples be all the points notin class i. Let fibe the ith classifier. Classify w i thf(x) = arg maxifi(x).Another Simple Idea — All-vs-AllClassificationBuild N (N −1) c l assifiers, one classifier to distinguish eachpair of classes i and j. Let fijbe the classifier where classi were positive examples and class j were negative. Notefji= −fij. Classify usingf(x) = arg maxiXjfij(x).Also called all-pairs or one-vs-one classification.The TruthOVA and AVA are so simple t ha t many pe ople i nvent e dthem independently. It’s hard t o write papers about them.So there’s a whole cottage industry in fancy, sophisticatedmethods for multicla ss classifica tion.To the best of my knowl e dge, choosing properly tunedregularization cl assifiers (RLSC, SVM) as your underlyingbinary classifiers and using one-vs-all (OVA) or all-vs-al l(AVA) works as well as anything else you can do.If you actually have to solve a multicla ss problem, I stronglyurge you to simply use OVA or AVA, and not worry aboutanything else. The choice between OVA and AVA is largelycomputational.OVA vs. AVAViewed naively, AVA seems faster and more memory effi-cient. It requires O(N2) classifiers instead of O(N), buteach classifier is (on av e r age) much smalle r . If the time tobuild a c l assifier is superlinear in the number of data points,AVA is a better choice. With SVMs, AVA’s probably best.Howev e r , if you can solve one RLS problem over your entiredata set using a m atrix factorization, you get multiclassclassification essentially for free (see RLS lecture). Sowith RLS, OVA’s a great choice.Other ApproachesThere have been two basic approaches to exte nding regu-larization ideas to multiclass classification:• “Single Machine” approaches — try to solve a singleoptimization problem that trains many binary classifierssimultaneously.• “Error Correcting Code” approaches — try to combinebinary classifiers in a way that lets you exploit decorre-lations and correct errors.These approaches are not completel y exc l usive.Weston and Watkins, VapnikThe first “single machine” approa c h:minf1,...,fN∈H,ξ∈ Rℓ(N−1)PNi=1||fi||2K+ CPℓi=1Pj6=yiξijsubject to : fyi(xi) + byi≥ fj(xi) + bj+ 2 − ξijξij≥ 0Key idea. Suppose that point i is in class yi. Then, forj 6= yi, we want (abusing our notation w.r.t. b),fyi(xi) − fj(xi) ≥ 2,or we pay a linear penalty of ξij.WW Analysis IThis idea seems intuitively reasonable. Is it good?Weston and Watk ins perform experiments. On 2 out of 5datasets, they find that their approach performs substan-tially better than OVA, and about the same on the re st .Howev e r , t hey state t ha t “to enable comparison, for eachalgorithm C = ∞ was chosen (the training data must beclassified without error),” so they are performing E R M,not regularization (C = ∞ ⇐⇒ λ = 0). A Gaussian kernelwas used, with σ the same for each method (not necessar-ily a good idea), and no infor mation about how this σ waschosen.WW Analysis IIUnder what circumstances would we expect this me thod tooutperfor m a OVA approach? Tough to say. We’d needa situat ion where it would be hard to actually separatethe data, but where there exist m e aningful subsets of thedata w here even though we can’t assign a positi v e valueto the correct class, we can assign a less negat i v e val ueto it than other classes. Or, we need subsets where eventhough we’re assigning positive values t o some incorrectclass, we’re assigning even more strongly positive valuesto the correc t c lasses.Challenge: Come up with a set of examples that actuallyhas this property. Double challenge: Have i t hold whenthe kernel is a properly-tuned Gaussian.WW Analysis IIIThere is no evidence that this schem e is a ny more accuratethan an OVA scheme. It is, however, much slower. Insteadof N problems of size ℓ, we need to solve a problem of size(N − 1)ℓ. Moreover, although the authors derive a dualproblem, they don’t show any way to decompose it; theresulting problem is much more complex than a standardSVM.Lee, Lin and Wahba: MotivationIn an earlier paper, Lin showed that asymptotically, as welet ℓ → ∞ and λ → 0, the solution of an SVM wi ll tend tof(x) = signp(x) −12.This is good for binary classification.Now consider m ult i c lass classification with an OVA scheme .In regions where there is a dominant class i for whichp(x) >12, all is good. If there isn’t, then all N of theOVA functions will return −1, a nd we will be unable t orecover the most likely class.Lee, Lin and Wahba: NotationFor i ∈ 1, . . . , N, define vias the N dimensional vector w itha 1 in the ith coordinate and −1N−1elsewhere. The vectorviserves as the “target” for points in class i — we try toget function outputs that are very close to the entries ofvi.Given a training point xi, we t r y to make fj= −1N−1forj 6= yi, and

View Full Document