**Unformatted text preview:**

SVMs with Slack Nicholas Ruozzi University of Texas at Dallas Based roughly on the slides of David Sontag Dual SVM max 0 1 2 such that The dual formulation only depends on inner products between 0 the data points Same thing is true if we use feature vectors instead 2 Dual SVM max 0 1 2 such that The dual formulation only depends on inner products between 0 the data points Same thing is true if we use feature vectors instead 3 The Kernel Trick More generally a kernel is a function for some feature map Rewrite the dual objective 0 0 max 1 2 4 Kernels Bigger feature space increases the possibility of overfitting Large margin solutions may still generalize reasonably well Alternative add penalties to the objective to disincentivize complicated solutions Not a quadratic program anymore in fact it s NP hard min 2 1 2 Similar problem to counting the number of misclassifications no notion of how badly the data is misclassified 5 SVMs with Slack Allow misclassification Penalize misclassification linearly just like in the perceptron algorithm Again easier to work with than counting misclassifications Objective stays convex Will let us handle data that isn t linearly separable 6 SVMs with Slack such that min 1 2 2 1 for all 0 for all 7 SVMs with Slack such that min 1 2 2 1 for all 0 for all Potentially allows some points to be misclassified inside the margin 8 SVMs with Slack such that min 1 2 2 Constant c determines degree to which slack is penalized 1 for all 0 for all 9 SVMs with Slack such that min 1 2 2 1 for all 0 for all How does this objective change with 10 SVMs with Slack such that min 1 2 2 1 for all 0 for all How does this objective change with As requires a perfect classifier As allows arbitrary classifiers i e ignores the data 0 11 SVMs with Slack such that min 1 2 2 1 for all How should we pick 0 for all 12 SVMs with Slack such that min 1 2 2 1 for all How should we pick 0 for all Divide the data into three pieces training testing and validation Use the validation set to tune the value of the hyperparameter 13 Evaluation Methodology General learning strategy Build a classifier using the training data Select hyperparameters using validation data Evaluate the chosen model with the selected hyperparameters on the test data How can we tell if we overfit the training data 14 ML in Practice Gather Data Labels Select feature vectors Randomly split into three groups Training set Validation set Test set Experimentation cycle Select a good hypothesis from the hypothesis space Tune hyperparameters using validation set Compute accuracy on test set fraction of correctly classified instances 15 SVMs with Slack such that min 1 2 2 1 for all What is the optimal value of 0 for all for fixed and 16 SVMs with Slack such that min 1 2 2 1 for all What is the optimal value of 0 for all for fixed and If then If 1 then 0 1 1 17 SVMs with Slack such that min 1 2 2 1 for all We can formulate this slightly differently 0 for all max 0 1 Does this look familiar 18 Hinge Loss Formulation Obtain a new objective by substituting in for min max 0 1 2 1 2 Penalty to prevent overfitting Hinge loss 19 Hinge Loss Formulation Obtain a new objective by substituting in for min max 0 1 2 1 2 Can minimize with gradient descent 20 Imbalanced Data If the data is imbalanced i e more positive examples than negative examples may want to evenly distribute the error between the two classes such that min 1 2 2 1 1 1 for all 0 for all 21 Dual of Slack Formulation such that min 1 2 2 1 for all 0 for all 22 Dual of Slack Formulation Convex in so take derivatives to form the dual 1 1 2 0 0 0 23 Dual of Slack Formulation max 0 1 2 such that 0 0 for all 24 Generalization We argued intuitively that SVMs generalize better than the perceptron algorithm How can we make this precise Coming soon but first 25 Other simple hypothesis spaces for supervised learning Roadmap Where are we headed nearest neighbor Decision trees Learning theory Generalization and PAC bounds VC dimension Bias variance tradeoff 26

View Full Document