9/23/20091Machine Learning - 10601Model Selection andNaïve BayesGeoff Gordon, Miroslav Dudík([[[partly based on slides of Tom Mitchell]http://www.cs.cmu.edu/~ggordon/10601/September 23, 2009 Announcements“You’re getting Ph.D.’s for a dollar an hour,” Reed Hastings, chief ofNetflix, said of the people competing for the prize.September 21,2009:Netflix awards $1 Million prize to a team of statisticians,machine-learning expertsand computer engineers9/23/20092How to win $1 MillionGoal:(user,movie) -> ratingData:100M (user,movie,date,rating) tuplesPerformance measure:root mean squared erroron withheld test setHow to win $1 MillionA part of the winning model is the “baseline model” capturing bulk of the information:[Koren 2009]9/23/20093How to win $1 Milliontraining set quiz set test setFAQ: why quiz/test split?We wanted a way of informing you … about your progress … while making it difficult for you to simply train and optimize against “the answer oracle”9/23/20094FAQ: why quiz/test split?Two goals for withholding data• model selection• model assessmentWhat if data is scarce?training set validation set test set9/23/20095Cross-validation• split data randomly into K equal parts• for each model setting:evaluate avg performance across K train-test splits• train the best model on the full data setPart 1 Part 2 Part 3Part 1 Part 2 evaluate errorPart 1 evaluate error Part 3evaluate error Part 2 Part 3Depends on the size of the data set:The best model…y ≈ w0+ w1x + w2x2+ w3x3+ w4x4+ … + w10x109/23/20096K-fold cross-validation trainson of the training dataControlling modelcomplexity• limit the number of features• add a “complexity penalty”9/23/20097Regularized estimationmin errortrain(w) + regularization(w)min -log p(data|w) - log p(w) Examples of regularizationminmin9/23/20098L2:L1:trainingerrorregularizationtraining error+regularizationL1vs L2L1• sparse solutions• more suitable when #featuresmuch larger than training setL2• computationally better-behavedHow do you choose λ?9/23/20099AnnouncementsHW #3 outdue October 7ClassificationGoal:learn a map h: x yData:(x1, y1), (x2, y2)… , (xN, yN) Performance measure:9/23/200910All you need to know is p(X,Y)…If you knew p(X,Y), how would youclassify an example x?Why?How many parametersneed to be estimated?Y binaryX described by M binary features X1,X2,…,XMData:p(X,Y) described by numbers9/23/200911Naïve Bayes Assumption• features of X conditionallyindependent given class YExample: Live in Sq Hill?• S=1 iff live in Sq Hill• G=1 iff shop in Sq Hill Giant Eagle• D=1 iff drive to CMU• A=1 iff owns a Mac9/23/200912Naïve Bayes Assumption• usually incorrect…• Naïve Bayes often performs well,even when the assumption is violated[see Domingos-Pazzani 1996]Learning to classify text documents• which emails are spam?• which emails promise an attachment?• which web pages are student home pages?What are the features of X?9/23/200913Feature Xjis the jth wordAssumption #1: Naïve Bayes9/23/200914Assumption #2: “Bag of words”“Bag of words” approach9/23/2009159/23/2009169/23/2009179/23/200918What you should knowabout Naïve BayesNaïve Bayes• assumption• why we use itText classification• bag of words modelGaussian Naïve Bayes• each feature a Gaussian given the
View Full Document