DOC PREVIEW
CORNELL CS 4700 - Study Notes

This preview shows page 1-2-22-23 out of 23 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 23 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 23 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 23 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 23 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 23 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Perceptrons and Optimal HyperplanesExample: Majority-Vote Function • Definition: Majority-Vote Function fmajority – N binary attributes, i.e. x  {0,1}N – If more than N/2 attributes in x are true, then fmajority(x)=1, else fmajority(x)=-1. • How can we represent this function as a decision tree? – Huge and awkward tree! • Is there an “easier” representation of fmajority?Example: Spam Filtering • Instance Space X: – Feature vector of word occurrences => binary features – N features (N typically > 50000) • Target Concept c: – Spam (+1) / Ham (-1) • Type of function to learn: – Set of Spam words S, Set of Ham words H – Classify as Spam (+1), if more Spam words than Ham words in example. viagra learning the dating lottery spam?Example: Spam Filtering • Use weight vector w=(+1, -1, 0, +1, +1) – Compute sign(wx) • More generally, we can use real valued weights to express “spamminess” of word. • w=(+10,-1,-0.3,+1,+5) • Which vector is most likely to be spam with this weighting? A=x1, B=x2, C=x3 viagra learning the dating lottery spam?Linear Classification Rules • Hypotheses of the form – unbiased: – biased: – Parameter vector w, scalar b • Hypothesis space H – – • Notation – – –Geometry of Hyperplane Classifiers • Linear Classifiers divide instance space as hyperplane • One side positive, other side negativeHomogeneous Coordinates X = (x1, x2) W = (w1, w2, b) X = (x1, x2, 1) W = (w1, w2, w3)1 0(Batch) Perceptron Algorithm Training EpochExample: Perceptron Training Data: Updates to weight vector: 3•Init: w=0, =1 •(w0 x1) = 0  incorrect w1 = w0 +  y1 x1 = 0+ 1*1*(1,2) = (1,2)  hw1x1 = (w0+1*1*x1) * x1 = hw0(x1)+ 1 * 1 * (x1*x1) = 0 + 5 •(w1x2) = (1,2) (3,1) = 5  correct •(w1  x3) = (1,2) (-1,-1) = -3  correct •(w1  x4) = (1,2) (-1,1) = 1  incorrect •w2 = (1,2) +  y4 x4 = (1,2) - (-1,1) = (2,1)  hw2 x4 = (w1+1*-1*x4) * x4 = hw1(x4) + 1 * -1 * (x4 * x4) = -1Example: Reuters Text Classification “optimal hyperplane”Optimal Hyperplanes Assumption: Training examples are linearly separable.Hard-Margin Separation Goal: Find hyperplane with the largest distance to the closest training examples. Support Vectors: Examples with minimal distance (i.e. margin). Optimization Problem (Primal): d d dWhy min ½w·w? • Maximizing δ and constraining w is equivalent to constraining δ and minimizing w – We want maximum margin δ, • we don’t care about w • But because δ=wx, just requiring maximum δ will yield large w… – So we ask for maximum δ but constrain w • This is equivalent to constraining δ and minimizing wNon-Separable Training Data Limitations of hard-margin formulation – For some training data, there is no separating hyperplane. – Complete separation (i.e. zero training error) can lead to suboptimal prediction error.SlackSoft-Margin Separation Idea: Maximize margin and minimize training error. Soft-Margin OP (Primal): Hard-Margin OP (Primal): • Slack variable ξi measures by how much (xi,yi) fails to achieve margin δ • Σξi is upper bound on number of training errors • C is a parameter that controls trade-off between margin and training error.Soft-Margin OP (Primal): A B Which of these two classifiers was produced using a larger value of C?Controlling Soft-Margin Separation •Σξi is upper bound on number of training errors •C is a parameter that controls trade-off between margin and training error. Soft-Margin OP (Primal):Example Reuters “acq”: Varying CExample: Margin in High-Dimension x1 x2 x3 x4 x5 x6 x7 y Example 1 1 0 0 1 0 0 0 1 Example 2 1 0 0 0 1 0 0 1 Example 3 0 1 0 0 0 1 0 -1 Example 4 0 1 0 0 0 0 1 -1 w1 w2 w3 w4 w5 w6 w7 b Hyperplane 1 1 1 0 0 0 0 0 2 Hyperplane 2 0 0 0 1 1 -1 -1 0 Hyperplane 3 1 -1 1 0 0 0 0 0 Hyperplane 4 1 -1 0 0 0 0 0 0 Hyperplane 5 0.95 -0.95 0 0.05 0.05 -0.05 -0.05


View Full Document

CORNELL CS 4700 - Study Notes

Download Study Notes
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Study Notes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Study Notes 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?