STEVENS CS 559 - CS 559 7th Set of Notes - D2353147

Home> Schools> Stevens Institute of Technology> (CS) > CS 559> CS 559 7th Set of Notes

STEVENS CS 559 - CS 559 7th Set of Notes

School name Stevens Institute of Technology

Course Cs 559- Machine Learning Syllabus

Pages 19

Download Save

Unformatted text preview:

1CS 559: Machine Learning Fundamentals and Applications7thSet of NotesInstructor: Philippos MordohaiWebpage: www.cs.stevens.edu/~mordohaiE-mail: [email protected]: Lieb 2151Project (35% of grade)• Topics: pick one BEFORE March 17• Has to be approved by March 17• Present for 10 minutes on March 26• CS Department poster session on April 28 (12-2pm)– 10% bonus on project grade• Present work on April 30• Submit brief report by May 4 (midnight)2Project Proposal• Project title• Data set• Project idea: What is the objective, what method(s) will be tested?• Relevant papers • Software you plan to write• Experiments you plan to do• Suggested topics: see last week’s slides or me32Midterm: March 19 (20% of grade)• Open book, open notes, open homeworksand solutions• Calculators allowed – Need exp, log and sqrt• NO laptops, no cellphones– Print out lecture notes, or ask me for extra copies 4Midterm: Format• Should take about 1hr and 30 min• True/False, Multiple Choice and/or Short Questions on theory • Probably 3 problems– Not necessarily similar to homework problems• Office hours during spring break: March 11, 4:30-6:00 and by appointment5Midterm: Material• Week 1: not included, but you should know the basics of probability theory• Week 2: Bayes Rule, Bayesian Decision Theory– Skip only hyper-quadrics (#55-56)• Week 3: Notes on Week 2, Maximum Likelihood and Bayesian Estimation• Week 4: Expectation Maximization and Hidden Markov Models– Skip Examples from Computer Vision (#34-78)– Skip decoding and learning in HMMs (#97-98)63Midterm: Material• Week 5: PCA– Skip example of HMMs in practice (#3-15)– You will not be required to compute eigenvectors or eigenvalues• Week 6: Eigenfaces, Fisher Linear Discriminant and Non-parametric Techniques• Week 7: up to what we end up covering today7Overview• Linear Discriminant Functions (DHS, Chap. 5 and notes based on Olga Veksler’s slides)– Two classes– Multiple classes– Optimization with gradient descent– Perceptron Criterion Function• Batch perceptron rule• Single sample perceptron rulePattern Classification, Chapter 5 8Linear Discriminant Functions: Basic Idea• Have samples from 2 classes x1, x2,…, xn• Assume 2 classes can be separated by a linear boundary l(q) with some unknown parameters q• Fit the “best” boundary to data by optimizing over parameters q• What is best?– Minimize classification error on training data?– Does not guarantee small testing errorPattern Classification, Chapter 5 94Parametric Methods vs. Discriminant Functions• Assume the shape of density for classes is known p1(x|θ1), p2(x| θ2),…• Estimate θ1, θ2,… from data• Use a Bayesian classifier to find decision regions• Assume discriminant functions are of known shape l(θ1), l(θ2), with parameters θ1, θ2,…• Estimate θ1, θ2,… from data• Use discriminant functions for classificationPattern Classification, Chapter 5 10Parametric Methods vs. Discriminant Functions• In theory, Bayesian classifier minimizes the risk– In practice, we do not have confidence in assumed model shapes– In practice, do not really need the actual density functions• Estimating accurate density functions is much harder than estimating accurate discriminant functions– Some argue that estimating densities should be skipped– Why solve a harder problem than needed?Pattern Classification, Chapter 5 11LDF: Introduction• Discriminant functions can be more general than linear• For now, focus on linear discriminant functions– Simple model (should try simpler models first)– Analytically tractable• Linear Discriminant functions are optimal for Gaussian distributions with equal covariance• May not be optimal for other data distributions, but they are very simple to use• Knowledge of class densities is not required when using linear discriminant functions– We can call it a non-parametric approachPattern Classification, Chapter 5 125LDF: Two Classes• A discriminant function is linear if it can be written asg(x) = wtx+ w0•w is called the weight vector and w0called bias or thresholdPattern Classification, Chapter 5 13LDF: Two Classes• Decision boundary g(x) = wtx+ w0 =0 is a hyperplane– Set of vectors x, which for some scalars a0,…, ad, satisfy a0+a1x(1)+…+ adx(d) = 0– A hyperplane is:– a point in 1D– a line in 2D– a plane in 3DPattern Classification, Chapter 5 14LDF: Two Classesg(x) = wtx+ w0•w determines the orientation of the decision hyperplane•w0determines the location of the decision surfacePattern Classification, Chapter 5 156LDF: Two ClassesPattern Classification, Chapter 5 16LDF: Multiple Classes• Suppose we have mclasses• Define m linear discriminant functionsgi(x) = witx+ wi0• Given x, assign to class ciif–gi(x)>gj(x), i≠j• Such a classifier is called a linear machine• A linear machine divides the feature space into c decision regions, with gi(x) being the largest discriminant if x is in the region RiPattern Classification, Chapter 5 17LDF: Multiple ClassesPattern Classification, Chapter 5 187LDF: Multiple Classes• For two contiguous regions RiandRj,the boundary that separates them is a portion of hyperplaneHijdefined by:• Thus wi–wjis normal to Hij• The distance from x toHijis given by:Pattern Classification, Chapter 5 19LDF: Multiple Classes• Decision regions for a linear machine are convex• In particular, decision regions must be spatially contiguousPattern Classification, Chapter 5 20LDF: Multiple Classes• Thus applicability of linear machine mostly limited to unimodal conditional densities p(x|θ)– Even though we did not assume any parametric models• Example:• Need non-contiguous decision regions• Linear machine will failPattern Classification, Chapter 5 218Augmented Feature Vector• Linear discriminant function: g(x) = wtx +w0• Can rewrite it: •y is called the augmented feature vector• Added a dummy dimension to get a completely equivalent new homogeneous problemPattern Classification, Chapter 5 22• Feature augmentation is done for simpler notation• From now on, always assume that we have augmented feature vectors– Given samples x1,…, xnconvert them to augmented samples y1,…, ynby adding a new dimension of value 1Pattern Classification, Chapter 5 23Training Error• For the rest of the lecture, assume we have 2 classes– Samples: y1,…, yn, some in class

View Full Document


School:
Email:
New Password:
Confirm Password:

STEVENS CS 559 - CS 559 7th Set of Notes

Sign up for free to view:

Please select your school