##
This **preview** shows page *1-2-3*
out of 9 **pages**.

*View Full Document*

End of preview. Want to read all 9 pages?

Upload your study docs or become a GradeBuddy member to access this document.

View Full Document**Unformatted text preview:**

Survey on Multiclass Classification MethodsMohamed Aly <[email protected]>November 2005AbstractSup ervi sed classification algorithms aim at pro ducing a learning modelfrom a lab eled training set. Various successful techniques have been pro-p os ed to solve the problem in the binary classification case. T he multi-class classification case is more delicate, as many of the algorithms wereintroduced basically to solve binary classification problems. In this shortsurvey we investigate the various techniques for solving the multiclassclassification problem.1 IntroductionSupervised multiclass classification algorithms aim at assigning a class label foreach input example. Given a training data set of the form (xi, yi), where xi∈ Rnis the ith example and yi∈ {1, . . . , K} is the ith class label, we aim at finding alearning model H such that H(xi) = yifor new unseen examples. The problemis simply formulated in the two class case, where the labels yiare just +1 or -1for the two classes involved. Several algorithms have been proposed to solve thisproblem in the two class case, some of which can be naturally extended to themulticlass case, and some that need special formulations to be able to solve thelatter case. The first category of algorithms include decision trees [5, 16], neuralnetworks [3], k-Nearest Neighbor [2], Naive Bayes classifiers [19], and SupportVector Machines [8]. The second category include approaches for converting themulticlass classification problem into a set of binary classification problems thatare efficiently solved using binary classifiers e.g. Support Vector Machines [8, 6].Another approach tries to pose a hierarchy on the output space, the availableclasses, and performs a series of tests to detect the class label of new patterns.2 Extensible algorithmsThe multiclass classification problem can be solved by naturally extending thebinary classification technique for some algorithms. These include neural net-works, decision trees, k-Nearest Neighbor, Naive Bayes, and Support VectorMachines.1Table 1: One-per-class CodingClass 1 1000Class 2 0100Class 3 0010Class 4 0001Table 2: Distributed codingClass 1 00000Class 2 00111Class 3 11001Class 4 111102.1 Neural NetworksMultiLayer Feedforward Neural Networks provide a natural extension to themulticlass problem. Instead of just having one neuron in the output layer,with binary output, we could haveN binary neurons. The output codewordcorresponding to each class can be chosen as follows: [10]:• One-p er-class coding: each output neuron is designated the task of iden-tifying a given class. The output code for that class should be 1 at thisneuron, and 0 for the others. Therefore, we will need N = K neuronsin the output layer, where K is the number of classes. When testing anunknown example, the neuron providing the maximum output is consid-ered the class label for that example. For instance, consider a four classproblem, we will have output codes as shown in table 1.• Distributed output coding: each class is assigned a unique binary code-word from 0 to 2N− 1, where N is the number of output neurons. Whentesting an unknown example, the output codeword is compared to thecodewords for the K classes, and the nearest codeword, according to somedistance measure, is considered the winning class. Usually the Hammingdistance is used in that case, which is the number of different bits betweenthe two codewords. For instance, for a 4 class problem, and using N = 5bit c odewords, the coding can be as shown in table 2. The hamming dis-tance between each pair of classes is equal to 3 i.e. each two codes differin three bits. If we got a codeword for an unknown example as 11101, wecompute its distance from the four codewords shown above. The nearestcodeword is that for class 3 with a distance of 1, so the class label assignedto that example will be class 3.22.2 Decision TreesDecision trees are a powerful classification technique. Two widely known al-gorithms for building decision trees are Classification and Regression Trees [5]and ID3/C4.5 [16]. The tree tries to infer a split of the training data basedon the values of the available features to produce a good generalization. Thesplit at each node is based on the feature that gives the maximum informationgain. Each leaf node corresponds to a class label. A new example is classifiedby following a path from the root no de to a leaf node, whe re at each node atest is performed on some feature of that example. The leaf node reached isconsidered the class label for that example. The algorithm can naturally handlebinary or multiclass classification problems. The leaf nodes can refer to eitherof the K classes concerned.2.3 k-Nearest NeighborskNN [2] is considered among the oldest non-parametric classification algorithms.To classify an unknown example, the distance (using some distance measuree.g. Eculidean) from that example to every other training example is mea-sured. The k smallest distances are identified, and the most represented class inthese k classes is considered the output class label. The value of k is normallydetermined using a validation set or using cross-validation.2.4 Naive BayesNaive Bayes [19] is a successful classifier based upon the principle of Maximum APosteriori (MAP). Given a problem with K classes {C1, . . . , CK} with so-calledprior probabilities P (C1), . . . , P (CK), we can assign the class label c to an un-known example with features x = (x1, . . . , xN) su ch that c = argmaxcP (C =ckx1, . . . , xN), that is choose the class with the maximum a posterior probabilitygiven the observed data. This aposterior probability can be formulated, usingBayes theorem, as follows: P (C = ckx1, . . . , xN) =P (C=c)P (x1,...,xNkC=c)P (x1,...,xN). Asthe denominator is the same for all clases, it can be dropped from the compari-son. Now, we should compute the so-called class conditional probabilities of thefeatures given the available classes. This can be quite difficult taking into ac-count the dependencies between features. The naive bayes approach is to assumeclass conditional independence i.e. x1, . . . , xNare independent given the class.This simplifies the numerator to be P (C = c)P (x1kC = c) . . . P (xNkC = c),and then choosing the class c that maximizes this value over all the classesc = 1, . . . , K. Clearly this approach is naturally extensible to the case of havingmore than two classes, and was shown to perform well inspite of the underlyingsimplifying assumption