Wright CS 707 - Support Vector Machine

Unformatted text preview:

NATURE BIOTECHNOLOGY VOLUME 24 NUMBER 12 DECEMBER 2006 1565What is a support vector machine?William S NobleSupport vector machines (SVMs) are becoming popular in a wide variety of biological applications. But, what exactly are SVMs and how do they work? And what are their most promising applications in the life sciences?A support vector machine (SVM) is a com-puter algorithm that learns by example to assign labels to objects1. For instance, an SVM can learn to recognize fraudulent credit card activity by examining hundreds or thousands of fraudulent and nonfraudulent credit card activity reports. Alternatively, an SVM can learn to recognize handwritten digits by examining a large collection of scanned images of hand-written zeroes, ones and so forth. SVMs have also been successfully applied to an increas-ingly wide variety of biological applications. A common biomedical application of support vector machines is the automatic classifica-tion of microarray gene expression profiles. Theoretically, an SVM can examine the gene expression profile derived from a tumor sample or from peripheral fluid and arrive at a diagnosis or prognosis. Throughout this primer, I will use as a motivating example a seminal study of acute leukemia expression profiles2. Other biological applications of SVMs involve classifying objects as diverse as protein and DNA sequences, micro-array expression profiles and mass spectra3.In essence, an SVM is a mathematical entity, an algorithm (or recipe) for maximizing a par-ticular mathematical function with respect to a given collection of data. The basic ideas behind the SVM algorithm, however, can be explained without ever reading an equation. Indeed, I claim that, to understand the essence of SVM classification, one needs only to grasp four basic concepts: (i) the separating hyperplane, (ii) the maximum-margin hyperplane, (iii) the soft margin and (iv) the kernel function.Before describing an SVM, let’s return to the problem of classifying cancer gene expression profiles. The Affymetrix microarrays employed by Golub et al.2 contained probes for 6,817 human genes. For a given bone marrow sample, the microarray assay returns 6,817 values, each of which represents the mRNA levels corre-sponding to a given gene. Golub et al. performed this assay on 38 bone marrow samples, 27 from individuals with acute lymphoblastic leuke-mia (ALL) and 11 from individuals with acute myeloid leukemia (AML). These data represent a good start to ‘train’ an SVM to tell the differ-ence between ALL and AML expression profiles. If the learning is successful, then the SVM will be able to successfully diagnose a new patient as AML or ALL based upon its bone marrow expression profile.For now, to allow an easy, geometric interpre-tation of the data, imagine that the microarrays contained probes for only two genes. In this case, our gene expression profiles consist of two num-bers, which can be easily plotted (Fig. 1a). Based upon results from a previous study4, I have selected the genes ZYX and MARCKSL1. In the figure, values are proportional to the intensity of the fluorescence on the microarray, so on either axis, a large value indicates that the gene is highly expressed and vice versa. The expression levels are indicated by a red or green dot, depending upon whether the sample is from a patient with ALL or AML. The SVM must learn to tell the difference between the two groups and, given an unlabeled expression vector, such as the one labeled ‘Unknown’ in the figure, predict whether it corresponds to a patient with ALL or AML.The separating hyperplaneThe human eye is very good at pattern rec-ognition. Even a quick glance at Figure 1a shows that the AML profiles form a cluster in the upper left region of the plot, and the ALL profiles cluster in the lower right. A simple rule might state that a patient has AML if the expression level of MARCKSL1 is twice as high as the expression level of ZYX, and vice versa for ALL. Geometrically, this rule corresponds to drawing a line between the two clusters (Fig. 1b). Subsequently, predicting the label of an unknown expression profile is easy: one simply needs to ask whether the new profile falls on the ALL or the AML side of this separating line.Now, to define the notion of a separating hyperplane, consider a situation in which the microarray does not contain just two genes. For example, if the microarray contains a single gene, then the ‘space’ in which the correspond-ing one-dimensional expression profiles reside is a one-dimensional line. We can divide this line in half by using a single point (Fig. 1c). In two dimensions (Fig. 1b), a straight line divides the space in half, and in three dimensions, we need a plane to divide the space (Fig. 1d). We can extrapolate this procedure mathematically to higher dimensions. The general term for a straight line in a high-dimensional space is a hyperplane, and so the separating hyperplane is, essentially, the line that separates the ALL and AML samples.The maximum-margin hyperplaneThe concept of treating the objects to be classi-fied as points in a high-dimensional space and finding a line that separates them is not unique to the SVM. The SVM, however, is different from other hyperplane-based classifiers by vir-tue of how the hyperplane is selected. Consider again the classification problem portrayed in Figure 1a. We have now established that the goal of the SVM is to identify a line that separates the ALL from the AML expression profiles in this two-dimensional space. However, many such lines exist (Fig. 1e). Which one provides the best classifier?With some thought, one may come up with the simple idea of selecting the line that is, roughly speaking, ‘in the middle’. In other words, one would select the line that separates the two classes but adopts the maximal distance from any one of the given expression profiles (Fig. 1f). It turns out that a theorem from the William S. Noble is in the Departments of Genome Sciences and of Computer Science and Engineering, University of Washington, 1705 NE Pacific Street, Seattle, Washington 98195, USA. e-mail: [email protected]© 2006 Nature Publishing Group http://www.nature.com/naturebiotechnology1566 VOLUME 24 NUMBER 12 DECEMBER 2006 NATURE BIOTECHNOLOGYfield of statistical learning theory supports exactly this choice5. If we define the distance from the separating hyperplane to the nearest expression vector as the margin of


View Full Document

Wright CS 707 - Support Vector Machine

Download Support Vector Machine
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Support Vector Machine and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Support Vector Machine 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?