**Unformatted text preview:**

Support Vector MachinesSupport Vector MachinesSupport Vector MachinesSupport Vector MachinesSupport Vector MachinesSome GeometrySome GeometrySome GeometryScale InvarianceScale InvarianceSome GeometrySVMsSVMsSVMsSVMsSVMsMulticlass ClassificationOne-Versus-All SVMsOne-Versus-All SVMsOne-Versus-All SVMsOne-Versus-All SVMsOne-Versus-One SVMsOne-Versus-One SVMsSupport Vector MachinesNicholas RuozziUniversity of Texas at DallasSlides adapted from David Sontag and Vibhav GogateSupport Vector Machines++++++++++++__________• How can we decide between perfect classifiers?2Support Vector Machines++++++++++++__________• How can we decide between perfect classifiers?3Support Vector Machines++++++++++++__________• Define the margin to be the distance of the closest data point to the classifier4• Support vector machines (SVMs)• Choose the classifier with the largest margin• Has good practical and theoretical performanceSupport Vector Machines++++++++++++__________5• In 𝑛𝑛 dimensions, a hyperplane is a solution to the equation𝑤𝑤𝑇𝑇𝑥𝑥 + 𝑏𝑏 = 0with 𝑤𝑤 ∈ ℝ𝑛𝑛, 𝑏𝑏 ∈ ℝ• The vector 𝑤𝑤 is sometimes called the normal vector of the hyperplaneSome Geometry𝑤𝑤𝑇𝑇𝑥𝑥 + 𝑏𝑏 = 0𝑤𝑤6• In 𝑛𝑛 dimensions, a hyperplane is a solution to the equation𝑤𝑤𝑇𝑇𝑥𝑥 + 𝑏𝑏 = 0• Note that this equation is scale invariant for any scalar 𝑐𝑐𝑐𝑐 ⋅ 𝑤𝑤𝑇𝑇𝑥𝑥 + 𝑏𝑏 = 0Some Geometry𝑤𝑤𝑇𝑇𝑥𝑥 + 𝑏𝑏 = 0𝑤𝑤7• The distance between a point 𝑦𝑦 and a hyperplane 𝑤𝑤𝑇𝑇+ 𝑏𝑏 = 0 is the length of the segment perpendicular to the line to the point 𝑦𝑦• The vector from 𝑦𝑦 to 𝑧𝑧 is given by𝑦𝑦 − 𝑧𝑧 = 𝑦𝑦 − 𝑧𝑧𝑤𝑤𝑤𝑤Some Geometry𝑤𝑤𝑇𝑇𝑥𝑥 + 𝑏𝑏 = 0𝑧𝑧𝑦𝑦8• By scale invariance, we can assume that 𝑐𝑐 = 1• The maximum margin is always attained by choosing 𝑤𝑤𝑇𝑇𝑥𝑥 + 𝑏𝑏 =0 so that it is equidistant from the closest data point classified as +1 and the closest data point classified as -1Scale Invariance𝑤𝑤𝑇𝑇𝑥𝑥 + 𝑏𝑏 = 0𝑧𝑧𝑦𝑦𝑤𝑤𝑇𝑇𝑥𝑥 + 𝑏𝑏 = 𝑐𝑐9• We want to maximize the margin subject to the constraints that𝑦𝑦(𝑖𝑖)𝑤𝑤𝑇𝑇𝑥𝑥𝑖𝑖+ 𝑏𝑏 ≥ 1• But how do we compute the size of the margin?Scale Invariance𝑤𝑤𝑇𝑇𝑥𝑥 + 𝑏𝑏 = 0𝑧𝑧𝑦𝑦𝑤𝑤𝑇𝑇𝑥𝑥 + 𝑏𝑏 = 𝑐𝑐𝑤𝑤𝑇𝑇𝑥𝑥 + 𝑏𝑏 = −𝑐𝑐10Putting it all together𝑦𝑦 − 𝑧𝑧 = 𝑦𝑦 − 𝑧𝑧𝑤𝑤𝑤𝑤and𝑤𝑤𝑇𝑇𝑦𝑦 + 𝑏𝑏 = 1𝑤𝑤𝑇𝑇𝑧𝑧 + 𝑏𝑏 = 0Some Geometry𝑤𝑤𝑇𝑇𝑦𝑦 − 𝑧𝑧 = 1and𝑤𝑤𝑇𝑇𝑦𝑦 − 𝑧𝑧 = 𝑦𝑦 − 𝑧𝑧 𝑤𝑤which gives𝑦𝑦 − 𝑧𝑧 = 1/ 𝑤𝑤𝑤𝑤𝑇𝑇𝑥𝑥 + 𝑏𝑏 = 0𝑧𝑧𝑦𝑦𝑤𝑤𝑇𝑇𝑥𝑥 + 𝑏𝑏 = 1𝑤𝑤𝑇𝑇𝑥𝑥 + 𝑏𝑏 = −111SVMs• This analysis yields the following optimization problemmax𝑤𝑤,𝑏𝑏1𝑤𝑤such that𝑦𝑦(𝑖𝑖)𝑤𝑤𝑇𝑇𝑥𝑥𝑖𝑖+ 𝑏𝑏 ≥ 1, for all 𝑖𝑖• Or, equivalently,min𝑤𝑤, 𝑏𝑏𝑤𝑤2such that𝑦𝑦(𝑖𝑖)𝑤𝑤𝑇𝑇𝑥𝑥𝑖𝑖+ 𝑏𝑏 ≥ 1, for all 𝑖𝑖12SVMsmin𝑤𝑤, 𝑏𝑏𝑤𝑤2such that𝑦𝑦(𝑖𝑖)𝑤𝑤𝑇𝑇𝑥𝑥𝑖𝑖+ 𝑏𝑏 ≥ 1, for all 𝑖𝑖• This is a standard quadratic programming problem• Falls into the class of convex optimization problems• Can be solved with many specialized optimization tools (e.g., quadprog() in MATLAB)13SVMs• Where does the name come from?• The set of all data points such that 𝑦𝑦(𝑖𝑖)(𝑤𝑤𝑇𝑇𝑥𝑥(𝑖𝑖)+ 𝑏𝑏) = 1 are called support vectors• The SVM classifier is completely determined by the support vectors (can delete the rest and get the same answer)𝑤𝑤𝑇𝑇𝑥𝑥 + 𝑏𝑏 = 0𝑧𝑧𝑦𝑦𝑤𝑤𝑇𝑇𝑥𝑥 + 𝑏𝑏 = 1𝑤𝑤𝑇𝑇𝑥𝑥 + 𝑏𝑏 = −114SVMs• What if the data isn’t linearly separable?• What if we want to do more than just binary classification (i.e., if 𝑦𝑦 ∈ {1,2,3})?15SVMs• What if the data isn’t linearly separable?• Use feature vectors• Relax the constraints (coming soon)• What if we want to do more than just binary classification (i.e., if 𝑦𝑦 ∈ {1,2,3})?16Multiclass Classification171111111122222222333333333One-Versus-All SVMs181111111122222222333333333One-Versus-All SVMs19Regions correctly classified by exactly one classifier1111111122222222333333333One-Versus-All SVMs• Compute a classifier for each label versus the remaining labels (i.e., and SVM with the selected label as plus and the remaining labels changed to minuses)• Let 𝑓𝑓𝑘𝑘𝑥𝑥 = 𝑤𝑤𝑘𝑘𝑇𝑇𝑥𝑥 + 𝑏𝑏(𝑘𝑘)be the classifier for the 𝑘𝑘𝑡𝑡𝑡label• For a new datapoint 𝑥𝑥, classify it as𝑘𝑘′∈ argmax𝑘𝑘𝑓𝑓𝑘𝑘(𝑥𝑥)• Drawbacks:• If there are 𝐿𝐿 possible labels, requires learning 𝐿𝐿 classifiers over the entire data set20One-Versus-All SVMs21Regions in which points are classified by highest value of 𝑤𝑤𝑇𝑇𝑥𝑥 + 𝑏𝑏1111111122222222333333333One-Versus-One SVMs• Alternative strategy is to construct a classifier for all possible pairs of labels• Given a new data point, can classify it by majority vote (i.e., find the most common label among all of the possible classifiers)• If there are 𝐿𝐿 labels, requires computing 𝐿𝐿2different classifiers each of which uses only a fraction of the data• Drawbacks: Can overfit if some pairs of labels do not have a significant amount of data (plus it can be computationally expensive)22One-Versus-One SVMs231111111122222222333333333Regions determined by majority vote over the

View Full Document