Wright CS 707 - Support Vector Machines

Unformatted text preview:

Support Vector MachinesText classificationLinear classifiers: Which Hyperplane?Another intuitionPowerPoint PresentationSupport Vector Machine (SVM)Robustness of SVMsMaximum Margin: FormalizationSlide 9Geometric MarginLinear SVM MathematicallyLinear Support Vector Machine (SVM)Linear SVMs Mathematically (cont.)Slide 16Slide 17Slide 18Slide 19Non-linear SVMsNonlinear SVMs: The Clever Bit!Slide 22Non-linear SVMs: Feature spaces(cont’d)Mathematical Details : SKIPSolving the Optimization ProblemThe Optimization Problem SolutionSoft Margin ClassificationSoft Margin Classification MathematicallySoft Margin Classification – SolutionClassification with SVMsLinear SVMs: SummarySlide 33The “Kernel Trick”KernelsEvaluation: Classic Reuters Data SetReuters Text Categorization data set (Reuters-21578) documentNew Reuters: RCV1: 810,000 docsPer class evaluation measuresDumais et al. 1998: Reuters - AccuracyResults for Kernels (Joachims 1998)Micro- vs. Macro-AveragingSKIPMicro- vs. Macro-Averaging: ExampleSlide 45Reuters ROC - Category GrainROC for Category - CrudeROC for Category - ShipYang&Liu: SVM vs. Other MethodsGood practice department: Confusion matrixThe Real WorldSlide 52Manually written rulesVery little data?A reasonable amount of data?A huge amount of data?Slide 57How many categories?How can one tweak performance?Does putting in “hacks” help?Two techniques for zonesText Summarization techniques in text classificationDoes stemming/lowercasing/… help?Measuring Classification Figures of MeritA common problem: Concept DriftSummaryPrasad L15SVM 1Support Vector Machines Adapted from Lectures byRaymond Mooney (UT Austin) and Andrew Moore (CMU)2Text classificationEarlier: Algorithms for text classificationK Nearest Neighbor classificationSimple, expensive at test time, low bias, high variance, non-linearVector space classification using centroids and hyperplanes that split themSimple, linear classifier; perhaps too simple; high bias, low varianceTodaySVMs: Some empirical evaluation and comparisonText-specific issues in classificationPrasad L15SVM3Linear classifiers: Which Hyperplane?Lots of possible solutions for a,b,c.Some methods find a separating hyperplane, but not the optimal one [according to some criterion of expected goodness]Support Vector Machine (SVM) finds an optimal solution.Maximizes the distance between the hyperplane and the “difficult points” close to decision boundaryIntuition: Then there are fewer uncertain classification decisionsThis line represents the decision boundary:ax + by - c = 0Prasad L15SVM15.04Another intuitionIf we place a fat separator between classes, we have less choices, and so the capacity of the model has been decreased.PrasadDefine the margin of a linear classifier as the width that the boundary could be increased by before hitting a datapoint.Prasad L15SVM 5Decision hyperplane Recall that a decision hyperplane can be defined by:  the intercept term b the normal vector w (weight vector), which is perpendicular to the hyperplaneAll points x on the hyperplane satisfy:6Support Vector Machine (SVM)Support vectorsMaximizemarginSVMs maximize the margin around the separating hyperplane.A.k.a. large margin classifiersThe decision function is fully specified by a subset of training samples, the support vectors.Solving SVMs is a Quadratic programming problemSeen by many as most successful current text classification method Prasad L15SVM15.1Robustness of SVMsIf we make a small error in the location of the boundary, this approach minimizes misclassification.Recall that support vectors are datapoints that the margin pushes up against.The model is immune to removal of any non-support vector datapoints.Rocchio centroids depend on all points of a class.kNN boundaries depend on local neighborhoods.Empirically SVMs work well.Prasad L15SVM 78w: decision hyperplane normalxi: data point iyi: class of data point i (+1 or -1) NB: Not 1/0Classifier is: f(xi) = sign(wTxi + b)Functional margin of xi is: yi (wTxi + b)But note that we can increase this margin simply by scaling w, b….Functional margin of dataset is minimum functional margin for any pointThe factor of 2 comes from measuring the whole width of the marginMaximum Margin: FormalizationPrasad L15SVM9The planar decision surface in data-space for the simple linear discriminant function: 00 wTxwPrasad L15SVM12Geometric MarginDistance from example to the separator is Examples closest to the hyperplane are support vectors. Margin ρ of the separator is the width of separation between support vectors of classes.wxw byrTrρxx′Prasad13Linear SVM MathematicallyAssume that all data is at least distance 1 from the hyperplane, then the following two constraints follow for a training set {(xi ,yi)} For support vectors, the inequality becomes an equalityThen, since each example’s distance from the hyperplane isThe geometric margin is:wTxi + b ≥ 1 if yi = 1wTxi + b ≤ -1 if yi = -1w2wxw byrTPrasad L15SVM14Linear Support Vector Machine (SVM)Hyperplane wT x + b = 0Extra scale constraint: mini=1,…,n |wTxi + b| = 1This implies: wT(xa–xb) = 2 ρ = ||xa–xb||2 = 2/||w||2wT x + b = 0wTxa + b = 1wTxb + b = -1ρPrasad L15SVM15Linear SVMs Mathematically (cont.)Then we can formulate the quadratic optimization problem: A better formulation (min ||w|| = max 1/ ||w|| ): Find w and b such that is maximized; and for all {(xi , yi)}wTxi + b ≥ 1 if yi=1; wTxi + b ≤ -1 if yi = -1w2Find w and b such thatΦ(w) =½ wTw is minimized; and for all {(xi ,yi)}: yi (wTxi + b) ≥ 1Prasad16RecapitulationWe start with a training data setWe feed the data through a quadratic optimization procedure to find the best separating hyperplaneGiven a new point to classify, the classification function computes the projection of the point onto the hyperplane normal.The sign of this function determines the class If the point is within the margin of the classifier, the classifier can return “don’t know”.The value of may also be transformed into a probability of classification17Multiclass support vector machinesSVMs: inherently two-class classifiers.Most common technique in practice: build |C| one-versus-rest classifiers (commonly referred to as


View Full Document

Wright CS 707 - Support Vector Machines

Download Support Vector Machines
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Support Vector Machines and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Support Vector Machines 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?