DOC PREVIEW
MIT 9 520 - Multiclass Classification

This preview shows page 1-2-3-4-27-28-29-30-56-57-58-59 out of 59 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 59 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 59 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 59 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 59 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 59 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 59 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 59 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 59 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 59 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 59 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 59 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 59 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 59 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Multiclass Classification9.520 Class 06, 25 Feb 2008Ryan Rifkin“It is a ta l eTold by an idiot, full of sound and fury,Signifying nothing.”Macbeth, Act V, Scene VWhat Is Multiclass Classification?Each training point belongs to one of N different classes.The goal is t o c onstruct a function which, given a newdata point, will correctly predict the class to which thenew point belongs.What Isn’t M u lticlass Classification?There are many scenarios in which there are multiple cate-gories to which points belong, but a given point can belongto multiple cate gories. In its most basic for m , this problemdecomposes triviall y into a set of unlinked binary pr oblems,which ca n be solved nat urally using our techniques for bi-nary classifica tion.A First IdeaSuppose we knew t he density, pi(x), for each of the Nclasses. Then, we would predict usingf(x) = arg maxi∈1,...,Npi(x).Of course we don’t know the densities, but we could esti-mate them using classical tec hniques.The Problem With Densities, andMotivationEstimating densities is hard, especially in high dim e nsionswith limited data.For binary classification tasks, we have seen that directlyestimating a smooth separating function gives better re-sults than density estimation (SVM, R LS C). Can we exte ndthese approaches usefully to the multicla ss scenario?A Simple Idea — One-vs-AllClassificationPick a good technique for building binary c lassifiers (e.g.,RLSC, SVM). Build N different binary cla ssifiers. For t heith classifier, let the positive examples be all t he points inclass i, and let the negative examples be all the points notin class i. Let fibe the ith classifier. Classify w i thf(x) = arg maxifi(x).Another Simple Idea — All-vs-AllClassificationBuild N (N −1) c l assifiers, one classifier to distinguish eachpair of classes i and j. Let fijbe the classifier where classi were positive examples and class j were negative. Notefji= −fij. Classify usingf(x) = arg maxiXjfij(x).Also called all-pairs or one-vs-one classification.The TruthOVA and AVA are so simple t ha t many pe ople i nvent e dthem independently. It’s hard t o write papers about them.So there’s a whole cottage industry in fancy, sophisticatedmethods for multicla ss classifica tion.To the best of my knowl e dge, choosing properly tunedregularization cl assifiers (RLSC, SVM) as your underlyingbinary classifiers and using one-vs-all (OVA) or all-vs-al l(AVA) works as well as anything else you can do.If you actually have to solve a multicla ss problem, I stronglyurge you to simply use OVA or AVA, and not worry aboutanything else. The choice between OVA and AVA is largelycomputational.OVA vs. AVAViewed naively, AVA seems faster and more memory effi-cient. It requires O(N2) classifiers instead of O(N), buteach classifier is (on av e r age) much smalle r . If the time tobuild a c l assifier is superlinear in the number of data points,AVA is a better choice. With SVMs, AVA’s probably best.Howev e r , if you can solve one RLS problem over your entiredata set using a m atrix factorization, you get multiclassclassification essentially for free (see RLS lecture). Sowith RLS, OVA’s a great choice.Other ApproachesThere have been two basic approaches to exte nding regu-larization ideas to multiclass classification:• “Single Machine” approaches — try to solve a singleoptimization problem that trains many binary classifierssimultaneously.• “Error Correcting Code” approaches — try to combinebinary classifiers in a way that lets you exploit decorre-lations and correct errors.These approaches are not completel y exc l usive.Weston and Watkins, VapnikThe first “single machine” approa c h:minf1,...,fN∈H,ξ∈ Rℓ(N−1)PNi=1||fi||2K+ CPℓi=1Pj6=yiξijsubject to : fyi(xi) + byi≥ fj(xi) + bj+ 2 − ξijξij≥ 0Key idea. Suppose that point i is in class yi. Then, forj 6= yi, we want (abusing our notation w.r.t. b),fyi(xi) − fj(xi) ≥ 2,or we pay a linear penalty of ξij.WW Analysis IThis idea seems intuitively reasonable. Is it good?Weston and Watk ins perform experiments. On 2 out of 5datasets, they find that their approach performs substan-tially better than OVA, and about the same on the re st .Howev e r , t hey state t ha t “to enable comparison, for eachalgorithm C = ∞ was chosen (the training data must beclassified without error),” so they are performing E R M,not regularization (C = ∞ ⇐⇒ λ = 0). A Gaussian kernelwas used, with σ the same for each method (not necessar-ily a good idea), and no infor mation about how this σ waschosen.WW Analysis IIUnder what circumstances would we expect this me thod tooutperfor m a OVA approach? Tough to say. We’d needa situat ion where it would be hard to actually separatethe data, but where there exist m e aningful subsets of thedata w here even though we can’t assign a positi v e valueto the correct class, we can assign a less negat i v e val ueto it than other classes. Or, we need subsets where eventhough we’re assigning positive values t o some incorrectclass, we’re assigning even more strongly positive valuesto the correc t c lasses.Challenge: Come up with a set of examples that actuallyhas this property. Double challenge: Have i t hold whenthe kernel is a properly-tuned Gaussian.WW Analysis IIIThere is no evidence that this schem e is a ny more accuratethan an OVA scheme. It is, however, much slower. Insteadof N problems of size ℓ, we need to solve a problem of size(N − 1)ℓ. Moreover, although the authors derive a dualproblem, they don’t show any way to decompose it; theresulting problem is much more complex than a standardSVM.Lee, Lin and Wahba: MotivationIn an earlier paper, Lin showed that asymptotically, as welet ℓ → ∞ and λ → 0, the solution of an SVM wi ll tend tof(x) = signp(x) −12.This is good for binary classification.Now consider m ult i c lass classification with an OVA scheme .In regions where there is a dominant class i for whichp(x) >12, all is good. If there isn’t, then all N of theOVA functions will return −1, a nd we will be unable t orecover the most likely class.Lee, Lin and Wahba: NotationFor i ∈ 1, . . . , N, define vias the N dimensional vector w itha 1 in the ith coordinate and −1N−1elsewhere. The vectorviserves as the “target” for points in class i — we try toget function outputs that are very close to the entries ofvi.Given a training point xi, we t r y to make fj= −1N−1forj 6= yi, and


View Full Document

MIT 9 520 - Multiclass Classification

Documents in this Course
Load more
Download Multiclass Classification
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Multiclass Classification and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Multiclass Classification 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?