New version page

STEVENS CS 559 - CS 559 Lecture Notes - Recap of Weeks 9-13

This preview shows page 1-2-3-4-5-6-42-43-44-45-46-47-85-86-87-88-89-90 out of 90 pages.

View Full Document
View Full Document

End of preview. Want to read all 90 pages?

Upload your study docs or become a GradeBuddy member to access this document.

View Full Document
Unformatted text preview:

1CS 559: Machine LearningCS 559: Machine Learning Fundamentals and Applications(Recap of Weeks 9-13)Instructor: Philippos MordohaiWebpage: www.cs.stevens.edu/~mordohaiEilPhili M d h [email protected] t dE-mail: [email protected]: Lieb 215Pattern Classification, Chapter 1OutlineOutline• All topics covered in 7thset of Notes (id )(midterm recap)• Fisher linear discriminant• Linear Discriminant Functions•Support Vector Machines (SVM)Support Vector Machines (SVM)• Ensemble MethodsUidLi•Unsupervised Learning• Other notes2Fisher Linear DiscriminantFisher Linear Discriminant• Main idea: find projection to a line such that pjsamples from different classes are well separatedPattern Classification, Chapter 3 3Fisher Linear Discriminant• We need to normalize by both scatter of class 1 and scatter of class 2Fisher Linear Discriminantand scatter of class 2• The Fisher linear discriminant is the projection on a line in the directionvwhich maximizeson a line in the direction v which maximizesPattern Classification, Chapter 3 4• Define within class scatter matrix•y=vtxiand•yi= vxand Pattern Classification, Chapter 3 5• Similarly• Define between class scatter matrix•SBmeasures separation of the means ofSBmeasures separation of the means of the two classes before projection•The separation of the projected means can•The separation of the projected means can be written asPattern Classification, Chapter 3 6• Thus our objective function can be written:•After some manipulation and after solving•After some manipulation and after solving a generalized eigenvalue problemPattern Classification, Chapter 3 7OutlineOutline• All topics covered in 7thset of Notes (id )(midterm recap)• Fisher linear discriminant• Linear Discriminant Functions•Support Vector Machines (SVM)Support Vector Machines (SVM)• Ensemble MethodsUidLi•Unsupervised Learning• Other notes8Linear Discriminant Functions: Basic Idea• Have samples from 2 classes x1, x2,…, xnp1,2,,n• Assume 2 classes can be separated by a linear boundary l(q) with some unknown parameters q•Fit the“best”boundary to data by optimizing over•Fit the best boundary to data by optimizing over parameters q• What is best?Minimize classification error on training data?–Minimize classification error on training data?– Does not guarantee small testing errorPattern Classification, Chapter 5 9Parametric Methods vs. Di i i F iDiscriminant Functions• Assume the shape of difl i• Assume discriminant fi fkdensity for classes is known p1(x|θ1), p2(x| θ2),…• Estimate θ1, θ2,… from dtfunctions are of known shape l(θ1), l(θ2), with parameters θ1, θ2,…Eti tθθfdata• Use a Bayesian classifier to find decision regions•Estimate θ1, θ2,… from data• Use discriminant functions for classificationfunctions for classificationPattern Classification, Chapter 5 10LDF: Two ClassesLDF: Two Classes• A discriminant function is linear if it can be written asg(x) = wtx+ w0•w is called the weight vector and w0is called the bias or thresholdor thresholdPattern Classification, Chapter 5 11LDF: Multiple ClassesLDF: Multiple Classes•Suppose we havemclassesSuppose we have mclasses• Define m linear discriminant functionsgi(x)=witx+wi0gi(x) wix wi0• Given x, assign to class ciif–gi(x)>gj(x),i≠jgi(x)gj(x), i j• Such a classifier is called a linear machine•A linear machine divides the feature spaceA linear machine divides the feature space into c decision regions, with gi(x) being the largest discriminant if x is in the region RiPattern Classification, Chapter 5 12LDF: Multiple ClassesLDF: Multiple Classes• For two contiguous regions RiandRj,the jboundary that separates them is a portion of the hyperplane Hijdefined by:j• Thus wi–wjis normal to HijTh di t ftHii b•The distance from x toHijis given by:Pattern Classification, Chapter 5 13Augmented Feature VectorAugmented Feature Vector• Linear discriminant function: g(x) = wtx +w0• Can rewrite it: •y is called the augmented feature vectoryg• Added a dummy dimension to get a completely equivalent newhomogeneous problemequivalent new homogeneous problemPattern Classification, Chapter 5 14• Feature augmentation is done for simpler notationnotation• From now on, always assume that we have augmented feature vectorsg– Given samples x1,…, xnconvert them to augmented samples y1,…, ynby adding a new dimension of value 1dimension of value 1Pattern Classification, Chapter 5 15“Normalization”• Training error is 0 if:• Equivalently, training error is 0 if:• This suggests “normalization” (a.k.a. reflection):1. Replace all examples from class 2 by:2. Seek weight vector a such that– If such a exists, it is called a separating or solution vectorOriginal samplesxxcan indeed be separated by–Original samples x1,…, xncan indeed be separated by a linePattern Classification, Chapter 5 16Solution RegionSolution Region• Solution region for a: set of all possible lti dfi di t f l tsolutions defined in terms of normal a to the separating hyperplanePattern Classification, Chapter 5 17Perceptron Criterion FunctionPerceptron Criterion Function• If y is misclassified, aty<0ThJ()>0•Thus Jp(a) >0•Jp(a) is ||a|| times sum of distances of misclassifieddistances of misclassified examples to decision boundary•Jp(a)is piecewise linear•Jp(a)is piecewise linear and thus suitable for gradient descentgPattern Classification, Chapter 5 18Perceptron Batch Rulep• Gradient of Jp(a) is: –YMare samples misclassified by a(k)–It is not possible to solve analytically because of YM•Update rule for gradient descent:•Update rule for gradient descent:• Thus the gradient decent batch update rule for J(a)is:Jp(a) is:•It is called batch rule because it is based on all•It is called batch rule because it is based on all misclassified examplesPattern Classification, Chapter 519Perceptron Single Sample RulePerceptron Single Sample Rule• The gradient decent single sample rulefor Jp(a) is:is:– Note that yMis one sample misclassified by a(k)–Must have a consistent way of visiting samplesMust have a consistent way of visiting samples• Geometric Interpretation: –yMmisclassified bya(k)yMmisclassified by a– yMis on the wrong side of decision hyperplane– Adding ηyMtoa moves thenew decision hyperplane in the right direction withthe right direction with respect to yMPattern Classification, Chapter 5


View Full Document
Loading Unlocking...
Login

Join to view CS 559 Lecture Notes - Recap of Weeks 9-13 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view CS 559 Lecture Notes - Recap of Weeks 9-13 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?