##
This **preview** shows page *1-2-3-4-25-26-27-51-52-53-54*
out of 54 **pages**.

*View Full Document*

End of preview. Want to read all 54 pages?

Upload your study docs or become a GradeBuddy member to access this document.

View Full Document**Unformatted text preview:**

1CS 559: Machine LearningCS 559: Machine Learning Fundamentals and Applications5thSet of NotesInstructor: Philippos MordohaiWebpage: www.cs.stevens.edu/~mordohaiEmail:Philippos [email protected] eduE-mail: [email protected]: Lieb 215Homework IIHomework II•Due next Thursday (Oct 7) at 6:15Due next Thursday (Oct. 7) at 6:152ErrataErrata3•Casei=arbitraryWeek 2: slide 96Case i arbitrary– The covariance matrices are different for each category:)( 0itiitiwherewxwxWxxg 21 W1i i)(lnln11 w w101iiiiitiiiiP(Hyperquadrics which are: hyperplanes, pairs of hyperplanes, hyperspheres, hyperellipsoids, )(22w0iiiiiiyp p,yp p,yp p,hyperparaboloids, hyperhyperboloids)4Pattern Classification, Chapter 2% testingcorrect=0; Week 4: slide 39wrong=0;for i=1:length(test_data)lklhood1=exp(-(test data(i active feat)-mean1)^2/(2*var1))lklhood1 exp((test_data(i,active_feat)mean1) 2/(2 var1)) /sqrt(var1);lklhood2 = exp(-(test_data(i,active_feat)-mean2)^2/(2*var2));/sqrt(var2);post1 = lklhood1*prior1;post2 = lklhood2*prior2;if(post1 > post2 && test_data(i,9) == 0)correct = correct+1;elseif(post1 < post2 && test data(i,9) == 1)pp_,correct = correct+1;elsewrong = wrong+1;endend5Week 4: slide 71• Need to compute EZ[zi(k)] in the above expression• We are finally done with the E stepfi l tti jt dt tE[(k)]N d–for implementation, just need to compute EZ[zi(k)]. No need to compute Q6Project: Data SetsProject: Data Sets• General – UCI ML repository: http://archive.ics.uci.edu/ml/– DELVE: http://www.cs.toronto.edu/~delve/Netflix Challenge:http://www cs uic edu/~liub/NetflixKDDCup–Netflix Challenge: http://www.cs.uic.edu/ liub/Netflix-KDD-Cup-2007.html• Text– Enron email dataset: http://www.cs.cmu.edu/~enron/– Web page classification: http://www-2.cs.cmu.edu/~webkb/•Optical Character RecognitionOptical Character Recognition– Stanford dataset: http://ai.stanford.edu/~btaskar/ocr/– NIST dataset: http://yann.lecun.com/exdb/mnist/7Project: Data SetsProject: Data Sets• Images:C l h 101h // i i l h d /I D /C l h101/–Caltech 101: http://www.vision.caltech.edu/Image_Datasets/Caltech101/– Caltech 256: http://www.vision.caltech.edu/Image_Datasets/Caltech256/– MIT Labelme http://labelme.csail.mit.edu/–PASCAL Visual Object Classes: SC sua Object C asseshttp://pascallin.ecs.soton.ac.uk/challenges/VOC/– Oxford buildings: http://www.robots.ox.ac.uk/~vgg/data/oxbuildings/index.html–ETH Computer Vision datasets:http://www.vision.ee.ethz.ch/datasets/ETH Computer Vision datasets: http://www.vision.ee.ethz.ch/datasets/• Face Images– Yale face database: http://cvc.yale.edu/projects/yalefaces/yalefaces.html– Labeled Faces in the Wild: http://vis-www.cs.umass.edu/lfw/Bi IDh // bi id / /d l d / f /bi idf–BioID: http://www.bioid.com/support/downloads/software/bioid-face-database.html8OverviewOverview•DHS Ch. 3:DHS Ch. 3: – Expectation Maximization (EM)• Covered in DHS, but will be presented differently here• Note on Overfitting• DHS Ch. 4 (Sec. 4.1-4.5)– Non-parametric Classification• Density estimation•Parzenwindows•Parzenwindows• K–Nearest Neighbor estimation9OverfittingOverfitting•Prediction error: probability of test patternPrediction error: probability of test pattern not in class with max posterior (true)•Training error: probability of test pattern•Training error: probability of test pattern not in class with max posterior (estimated)Cl ifi i i dii•Classifier optimized w.r.t. training error– Training error: optimistically biased estimate fdiiof prediction errorNOT in DHS 10OverfittingOverfittingOverfitting:a learning algorithmoverfitstheOverfitting:a learning algorithm overfitsthe training data if it outputs a solution w when exists another solutionw’such that:exists another solution w such that:()(’)errortrain(w) < errortrain(w’) ANDerrortrue(w’) < errortrue(w) NOT in DHS 11Fish Classifier from DHS Ch 1Fish Classifier from DHS Ch. 1Pattern Classification, Chapter 1 12Minimum Training ErrorMinimum Training ErrorPattern Classification, Chapter 1 13Final Decision BoundaryFinal Decision BoundaryPattern Classification, Chapter 1 14Training/Test SplitTraining/Test Split •Randomly split dataset into two parts:Randomly split dataset into two parts:– Training dataTest data–Test data• Use training data to optimize parameters • Evaluate error using test dataNOT in DHS 15Training/Test SplitTraining/Test Split •How many points in each set?How many points in each set?• Very hard questionT f it i t ii tl d–Too few points in training set, learned classifier is badToo few points in test set classifier evaluation–Too few points in test set, classifier evaluation is insufficientLeaveoneout crossvalidation•Leave-one-out cross-validation• BootstrappingNOT in DHS 16Non-parametric ClassificationNon-parametric Classification17The HistogramThe Histogram• The simplest form of non-parametric density estimation is the histogram–Divide sample space in number of bins – Approximate the density at the center of each bin by the fraction of points that fall into the bin–Two parameters: bin width and starting position of first bin (or other pgp(equivalent pairs)• Drawbacks:– Depends on position of bin centersce e s• Often compute two histograms, offset by ½ bin width– Discontinuities as an artifact of bin boundariesof bin boundaries– Curse of dimensionality18IntroductionIntroduction•All parametric densities are unimodal(have a single local p(gmaximum), whereas many practical problems involve multi-modal densities• Non-parametric procedures can be used with arbitrary distributions and without the assumption that the forms of the underlying densities are knownthe underlying densities are known• There are two types of non-parametric methods:Eti t P( |)–Estimate P(x | j ) – Bypass density function and go directly to posterior probability estimationPattern Classification, Chapter 4 19Density Estimation– Probability that a vector x will fall in region R is: (1) ')'( dxxpP– P is a smoothed (or averaged) version of the density function p(x) if we have a sample of size n; therefore, the probability that k points fall in R is then:the probability that k points fall in R is then:(2) )1( knkkPPknPand the expected value for kis:E(k) =nP(3)E(k) = nP(3)Pattern Classification, Chapter

View Full Document