Support Vector MachinesText classificationLinear classifiers: Which Hyperplane?Another intuitionPowerPoint PresentationSupport Vector Machine (SVM)Robustness of SVMsMaximum Margin: FormalizationSlide 9Geometric MarginLinear SVM MathematicallyLinear Support Vector Machine (SVM)Linear SVMs Mathematically (cont.)Slide 16Slide 17Slide 18Slide 19Non-linear SVMsNonlinear SVMs: The Clever Bit!Slide 22Non-linear SVMs: Feature spaces(cont’d)Mathematical Details : SKIPSolving the Optimization ProblemThe Optimization Problem SolutionSoft Margin ClassificationSoft Margin Classification MathematicallySoft Margin Classification – SolutionClassification with SVMsLinear SVMs: SummarySlide 33The “Kernel Trick”KernelsEvaluation: Classic Reuters Data SetReuters Text Categorization data set (Reuters-21578) documentNew Reuters: RCV1: 810,000 docsPer class evaluation measuresDumais et al. 1998: Reuters - AccuracyResults for Kernels (Joachims 1998)Micro- vs. Macro-AveragingSKIPMicro- vs. Macro-Averaging: ExampleSlide 45Reuters ROC - Category GrainROC for Category - CrudeROC for Category - ShipYang&Liu: SVM vs. Other MethodsGood practice department: Confusion matrixThe Real WorldSlide 52Manually written rulesVery little data?A reasonable amount of data?A huge amount of data?Slide 57How many categories?How can one tweak performance?Does putting in “hacks” help?Two techniques for zonesText Summarization techniques in text classificationDoes stemming/lowercasing/… help?Measuring Classification Figures of MeritA common problem: Concept DriftSummaryPrasad L15SVM 1Support Vector Machines Adapted from Lectures byRaymond Mooney (UT Austin) and Andrew Moore (CMU)2Text classificationEarlier: Algorithms for text classificationK Nearest Neighbor classificationSimple, expensive at test time, low bias, high variance, non-linearVector space classification using centroids and hyperplanes that split themSimple, linear classifier; perhaps too simple; high bias, low varianceTodaySVMs: Some empirical evaluation and comparisonText-specific issues in classificationPrasad L15SVM3Linear classifiers: Which Hyperplane?Lots of possible solutions for a,b,c.Some methods find a separating hyperplane, but not the optimal one [according to some criterion of expected goodness]Support Vector Machine (SVM) finds an optimal solution.Maximizes the distance between the hyperplane and the “difficult points” close to decision boundaryIntuition: Then there are fewer uncertain classification decisionsThis line represents the decision boundary:ax + by - c = 0Prasad L15SVM15.04Another intuitionIf we place a fat separator between classes, we have less choices, and so the capacity of the model has been decreased.PrasadDefine the margin of a linear classifier as the width that the boundary could be increased by before hitting a datapoint.Prasad L15SVM 5Decision hyperplane Recall that a decision hyperplane can be defined by: the intercept term b the normal vector w (weight vector), which is perpendicular to the hyperplaneAll points x on the hyperplane satisfy:6Support Vector Machine (SVM)Support vectorsMaximizemarginSVMs maximize the margin around the separating hyperplane.A.k.a. large margin classifiersThe decision function is fully specified by a subset of training samples, the support vectors.Solving SVMs is a Quadratic programming problemSeen by many as most successful current text classification method Prasad L15SVM15.1Robustness of SVMsIf we make a small error in the location of the boundary, this approach minimizes misclassification.Recall that support vectors are datapoints that the margin pushes up against.The model is immune to removal of any non-support vector datapoints.Rocchio centroids depend on all points of a class.kNN boundaries depend on local neighborhoods.Empirically SVMs work well.Prasad L15SVM 78w: decision hyperplane normalxi: data point iyi: class of data point i (+1 or -1) NB: Not 1/0Classifier is: f(xi) = sign(wTxi + b)Functional margin of xi is: yi (wTxi + b)But note that we can increase this margin simply by scaling w, b….Functional margin of dataset is minimum functional margin for any pointThe factor of 2 comes from measuring the whole width of the marginMaximum Margin: FormalizationPrasad L15SVM9The planar decision surface in data-space for the simple linear discriminant function: 00 wTxwPrasad L15SVM12Geometric MarginDistance from example to the separator is Examples closest to the hyperplane are support vectors. Margin ρ of the separator is the width of separation between support vectors of classes.wxw byrTrρxx′Prasad13Linear SVM MathematicallyAssume that all data is at least distance 1 from the hyperplane, then the following two constraints follow for a training set {(xi ,yi)} For support vectors, the inequality becomes an equalityThen, since each example’s distance from the hyperplane isThe geometric margin is:wTxi + b ≥ 1 if yi = 1wTxi + b ≤ -1 if yi = -1w2wxw byrTPrasad L15SVM14Linear Support Vector Machine (SVM)Hyperplane wT x + b = 0Extra scale constraint: mini=1,…,n |wTxi + b| = 1This implies: wT(xa–xb) = 2 ρ = ||xa–xb||2 = 2/||w||2wT x + b = 0wTxa + b = 1wTxb + b = -1ρPrasad L15SVM15Linear SVMs Mathematically (cont.)Then we can formulate the quadratic optimization problem: A better formulation (min ||w|| = max 1/ ||w|| ): Find w and b such that is maximized; and for all {(xi , yi)}wTxi + b ≥ 1 if yi=1; wTxi + b ≤ -1 if yi = -1w2Find w and b such thatΦ(w) =½ wTw is minimized; and for all {(xi ,yi)}: yi (wTxi + b) ≥ 1Prasad16RecapitulationWe start with a training data setWe feed the data through a quadratic optimization procedure to find the best separating hyperplaneGiven a new point to classify, the classification function computes the projection of the point onto the hyperplane normal.The sign of this function determines the class If the point is within the margin of the classifier, the classifier can return “don’t know”.The value of may also be transformed into a probability of classification17Multiclass support vector machinesSVMs: inherently two-class classifiers.Most common technique in practice: build |C| one-versus-rest classifiers (commonly referred to as
View Full Document