MIT 9 520 - Multiclass Classification - D402619

Home> Schools> Massachusetts Institute of Technology> Brain and Cognitive Sciences (9) > 9 520> Multiclass Classification

DOC PREVIEW

MIT 9 520 - Multiclass Classification

School name Massachusetts Institute of Technology

Course 9 520- Statistical Learning Theory and Applications

Pages 59

This preview shows page 1-2-3-4-27-28-29-30-56-57-58-59 out of 59 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 59 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 59 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 59 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 59 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 59 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 59 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 59 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 59 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 59 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 59 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 59 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 59 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 59 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Multiclass Classification9.520 Class 08, 06 March 2006Ryan Rifkin“It is a taleTold by an idiot, full of sound and fury,Signifying nothing.”Macbeth, Act V, Scene VWhat Is Multiclass Classification?Each training point belongs to one of N different classes.The goal is to construct a function which, given a newdata point, will correctly predict the class to which thenew point belongs.What Isn’t Multiclass Classification?There are many scenarios in which there are multiple cate-gories to which points belong, but a given point can belongto multiple categories. In its most basic form, this problemdecomposes trivially into a set of unlinked binary problems,which can be solved naturally using our techniques for bi-nary classification.A First IdeaSuppose we knew the density, pi(x), for each of the Nclasses. Then, we would predict usingf(x) = arg maxi∈1,...,Npi(x).Of course we don’t know the densities, but we could esti-mate them using classical techniques.The Problem With Densities, andMotivationEstimating densities is hard, especially in high dimensionswith limited data.For binary classification tasks, we have seen that directlyestimating a smooth separating function gives better re-sults than density estimation (SVM, RLSC). Can we extendthese approaches usefully to the multiclass scenario?A Simple Idea — One-vs-AllClassificationPick a good technique for building binary classifiers (e.g.,RLSC, SVM). Build N different binary classifiers. For theith classifier, let the positive examples be all the points inclass i, and let the negative examples be all the points notin class i. Let fibe the ith classifier. Classify withf(x) = arg maxifi(x).Another Simple Idea — All-vs-AllClassificationBuild N(N −1) classifiers, one classifier to distinguish eachpair of classes i and j. Let fijbe the classifier where classi were positive examples and class j were negative. Notefji= −fij. Classify usingf(x) = arg maxiXjfij(x).Also called all-pairs or one-vs-one classification.The TruthOVA and AVA are so simple that many people inventedthem independently. It’s hard to write papers about them.So there’s a whole cottage industry in fancy, sophisticatedmethods for multiclass classification.To the best of my knowledge, choosing properly tunedregularization classifiers (RLSC, SVM) as your underlyingbinary classifiers and using one-vs-all (OVA) or all-vs-all(AVA) works as well as anything else you can do.If you actually have to solve a multiclass problem, I stronglyurge you to simply use OVA or AVA, and not worry aboutanything else. The choice between OVA and AVA is largelycomputational.OVA vs. AVAViewed naively, AVA seems faster and more memory effi-cient. It requires O(N2) classifiers instead of O(N), buteach classifier is (on average) much smaller. If the time tobuild a classifier is superlinear in the number of data points,AVA is a better choice. With SVMs, AVA’s probably best.However, if you can solve one RLS problem over your entiredata set using a matrix factorization, you get multiclassclassification essentially for free (see RLS lecture). Sowith RLS, OVA’s a great choice.Other ApproachesThere have been two basic approaches to extending regu-larization ideas to multiclass classification:• “Single Machine” approaches — try to solve a singleoptimization problem that trains many binary classifierssimultaneously.• “Error Correcting Code” approaches — try to combinebinary classifiers in a way that lets you exploit decorre-lations and correct errors.These approaches are not completely exclusive.Weston and Watkins, VapnikThe first “single machine” approach:minf1,...,fN∈H,ξ∈R`(N−1)PNi=1||fi||2K+ CP`i=1Pj6=yiξijsubject to : fyi(xi) + byi≥ fj(xi) + bj+ 2 − ξijξij≥ 0Key idea. Suppose that point i is in class yi. Then, forj 6= yi, we want (abusing our notation w.r.t. b),fyi(xi) − fj(xi) ≥ 2,or we pay a linear penalty of ξij.WW Analysis IThis idea seems intuitively reasonable. Is it good?Weston and Watkins perform experiments. On 2 out of 5datasets, they find that their approach performs substan-tially better than OVA, and about the same on the rest.However, they state that “to enable comparison, for eachalgorithm C = ∞ was chosen (the training data must beclassified without error),” so they are performing ERM,not regularization (C = ∞ ⇐⇒ λ = 0). A Gaussian kernelwas used, with σ the same for each method (not necessar-ily a good idea), and no information about how this σ waschosen.WW Analysis IIUnder what circumstances would we expect this method tooutperform a OVA approach? Tough to say. We’d needa situation where it would be hard to actually separatethe data, but where there exist meaningful subsets of thedata where even though we can’t assign a positive valueto the correct class, we can assign a less negative valueto it than other classes. Or, we need subsets where eventhough we’re assigning positive values to some incorrectclass, we’re assigning even more strongly positive valuesto the correct classes.Challenge: Come up with a set of examples that actuallyhas this property. Double challenge: Have it hold whenthe kernel is a properly-tuned Gaussian.WW Analysis IIIThere is no evidence that this scheme is any more accuratethan an OVA scheme. It is, however, much slower. Insteadof N problems of size `, we need to solve a problem of size(N − 1)`. Moreover, although the authors derive a dualproblem, they don’t show any way to decompose it; theresulting problem is much more complex than a standardSVM.Lee, Lin and Wahba: MotivationIn an earlier paper, Lin showed that asymptotically, as welet ` → ∞ and λ → 0, the solution of an SVM will tend tof(x) = signp(x) −12.This is good for binary classification.Now consider multiclass classification with an OVA scheme.In regions where there is a dominant class i for whichp(x) >12, all is good. If there isn’t, then all N of theOVA functions will return −1, and we will be unable torecover the most likely class.Lee, Lin and Wahba: NotationFor i ∈ 1, . . . , N, define vias the N dimensional vector witha 1 in the ith coordinate and −1N−1elsewhere. The vectorviserves as the “target” for points in class i — we try toget function outputs that are very close to the entries ofvi.Given a training pointxi, we try to make fj= −1N−1forj 6= yi, and then we also require thatPNj=1f(xi) = 0. Thisleads us to. . .Lee, Lin and Wahba:

View Full Document