Unformatted text preview:

1CS 559: Machine Learning Fundamentals and Applications12thSet of NotesInstructor: Philippos MordohaiWebpage: www.cs.stevens.edu/~mordohaiE-mail: [email protected]: Lieb 2151Online Course EvaluationPlease visit https://www.stevens.edu/assessand evaluate the course2Project (35% of grade)• CS Department Project Poster Day (April 28, 12-2)– 5% of total grade bonus• Presentation in class (April 30)– 15% of total grade• Final report due (May 4, midnight) – 15% of total grade32CS Department Project Poster Day• April 28, 12-2• Lieb third floor conference room and corridors• 5% of total grade bonus– Not negligible– Not unfair for those who cannot make it• Suggestion: 9-12 printed pages 4CS Department Project Poster Day• Your name, course number, project title• Project objective: what are you trying to accomplish?• Method: which method(s) will be tested– General description of method (not necessarily for the problem at hand)• Data set description– Include test/train split• Pre- and post-processing specific to this problem• Experiments• Conclusions on methods and experimental results5Project Presentation• April 30• 15% of total grade• Let me know in advance if you cannot make it• Use previous structure– Besides comprehensive results, also show individual examples (correct and wrong)• Plan for 20 minutes63Final Report• Due May 4 (midnight)• 8-10 pages including figures, tables and references• Counts for 15% of total grade• 33% late penalty per day7Instructions for Final• Emphasis on new material, not covered in Midterm• Old material still in– Except EM• Open book, open notes, open homeworksand solutions• No laptops, no cellphones, calculators OK– No graphical solutions. Show all computations8Final: Material• Notes 1: not included, but you should know the basics of probability theory• Notes 2: Bayes Rule, Bayesian Decision Theory– Skip only hyper-quadrics (#55-56)• Notes 3: Notes on Week 2, Maximum Likelihood and Bayesian Estimation• Notes 4: Expectation Maximization and Hidden Markov Models– Skip EM (#3-33)– Skip Examples from Computer Vision (#34-78)– Skip decoding and learning in HMMs (#97-98)94Final: Material• Notes 5: PCA– Skip example of HMMs in practice (#3-15)– You will not be required to compute eigenvectors or eigenvalues• Notes 6: Eigenfaces, Fisher Linear Discriminant and Non-parametric Techniques• Notes 7: Linear Discriminant Functions and the Perceptron10Final: Material• Notes 8: MSE procedures for LDFs, Multi-class LDFs• Notes 9: Support Vector Machines – No need to compute objective function (as in #12 and thereafter)• Notes 10: SVM additional material, cross-validation, Boosting– Skip training error and margin analysis (#40-47)– Check readability of printouts11Final: Material• Notes 11: Graphical models, Bayesian Networks, Markov Random Fields– Skip D-separation (#24-27)– Skip conversion from directed to undirected graphs (#30-32)– Skip inference and factor graphs (#35-52)• Notes 12: unsupervised learning, k-means and hierarchical clustering– Skip #57-end125Overview• Unsupervised Learning (slides by Olga Veksler)– Supervised vs. unsupervised learning– Unsupervised learning– Flat clustering (k-means)– Hierarchical clustering (also see DHS Ch. 10)13Supervised vs. Unsupervised Learning• Up to now we considered supervised learning scenarios, where we are given1. samples x1,…, xn2. class labels for all samples– This is also called learning with teacher, since correct answer (the true class) is provided• Today we consider unsupervised learning scenarios, where we are only given1. samples x1,…, xn– This is also called learning without teacher, since correct answer is not provided– Do not split data into training and test sets14Unsupervised Learning• Data is not labeled • Parametric Approach– Assume parametric distribution of data– Estimate parameters of this distribution– Remember Expectation Maximization?• Non-Parametric Approach– Group the data into clusters, each cluster (hopefully) says something about classes present in the data156Why Unsupervised Learning?• Unsupervised learning is harder– How do we know if results are meaningful? No answer (labels) is available• Let the expert look at the results (external evaluation)• Define an objective function on clustering (internal evaluation)• We nevertheless need it because1. Labeling large datasets is very costly (speech recognition, object detection in images) • Sometimes can label only a few examples by hand2. May have no idea what/how many classes there are (data mining)3. May want to use clustering to gain some insight into the structure of the data before designing a classifier• Clustering as data description16Clustering• Seek “natural” clusters in the data• What is a good clustering?– internal (within the cluster) distances should be small– external (intra-cluster) should be large• Clustering is a way to discover new categories (classes)17What we need for Clustering1. Proximity measure, either – similarity measure s(xi,xk): large if xi,xkare similar– dissimilarity(or distance) measure d(xi,xk): small if xi,xkare similar2. Criterion function to evaluate a clustering3. Algorithm to compute clustering– For example, by optimizing the criterion function187How Many Clusters?• Possible approaches1. Fix the number of clusters to k2. Find the best clustering according to the criterion function (number of clusters may vary)19Proximity Measures• A good proximity measure is VERY application dependent– Clusters should be invariant under the transformations “natural” to the problem– For example for object recognition, we should have invariance to rotation– For character recognition, no invariance to rotation20Distance Measures• Euclidean distance– translation invariant• Manhattan (city block) distance– approximation to Euclidean distance, cheaper to compute• Chebyshev distance– approximation to Euclidean distance, cheapest to compute218Feature Scaling• Old problem: how to choose appropriate relative scale for features?– [length (in meters or cms?), weight(in in grams or kgs?)]• In supervised learning, can normalize to zero mean unit variance with no problems• In clustering this is more problematic•If variance in data is due to cluster presence, then normalizing features is not a good


View Full Document

STEVENS CS 559 - CS 559 12th Set of Notes

Download CS 559 12th Set of Notes
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view CS 559 12th Set of Notes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view CS 559 12th Set of Notes 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?