CS 188: Artificial Intelligence Spring 2006AnnouncementsTodayGeneral Naïve BayesExample: Spam FilteringEstimation: Laplace SmoothingSlide 7Estimation: Linear InterpolationReal NB: SmoothingTuning on Held-Out DataSpam ExampleConfidences from a ClassifierPrecision vs. RecallSlide 14Errors, and What to DoWhat to Do About Errors?FeaturesFeature ExtractorsGenerative vs. DiscriminativeSome (Vague) BiologyThe Binary PerceptronExample: SpamBinary Decision RuleThe Multiclass PerceptronExampleThe Perceptron Update RuleSlide 27Mistake-Driven ClassificationProperties of PerceptronsIssues with PerceptronsSummaryCS 188: Artificial IntelligenceSpring 2006Lecture 10: Perceptrons2/16/2006Dan Klein – UC BerkeleyMany slides from either Stuart Russell or Andrew MooreAnnouncementsOffice hours:Dan’s W 3-5 office hours moved to F 3-5 (just this week)Project 2: Out nowWritten 1 back (check in front)Fill in your final exam time surveys! (in front)TodayNaïve Bayes modelsSmoothingReal world issuesPerceptronsMistake-driven learningData separation, margins, and convergenceGeneral Naïve BayesThis is an example of a naive Bayes model:Total number of parameters is linear in n!CE1EnE2Example: Spam FilteringModel:Parameters:the : 0.016to : 0.015and : 0.012...free : 0.001click : 0.001...morally : 0.001nicely : 0.001...the : 0.021to : 0.013and : 0.011...free : 0.005click : 0.004...screens : 0.000minute : 0.000...ham : 0.66spam: 0.33Estimation: Laplace SmoothingLaplace’s estimate:Pretend you saw every outcome once more than you actually didCan derive this as a maximum a posteriori estimate using Dirichlet priors (see cs281a)H H TEstimation: Laplace SmoothingLaplace’s estimate (extended):Pretend you saw every outcome k extra timesWhat’s Laplace with k = 0?k is the strength of the priorLaplace for conditionals:Smooth each condition independently:H H TEstimation: Linear Interpolation In practice, Laplace often performs poorly for P(X|Y):When |X| is very largeWhen |Y| is very largeAnother option: linear interpolationGet unconditional P(X) from the dataMake sure the estimate of P(X|Y) isn’t too different from P(X)What if is 0? 1?For even better ways to estimate parameters, as well as details of the math see cs281a, cs294-5Real NB: SmoothingFor real classification problems, smoothing is critical… and usually done badly, even in big commercial systemsNew odds ratios:helvetica : 11.4seems : 10.8group : 10.2ago : 8.4areas : 8.3...verdana : 28.8credit : 28.4order : 27.2<font> : 26.9money : 26.5...Do these make more sense?Tuning on Held-Out DataNow we’ve got two kinds of unknownsParameters: the probabilities P(Y|X), P(Y)Hyper-parameters, like the amount of smoothing to do: k, Where to learn?Learn parameters from training dataMust tune hyper-parameters on different dataWhy?For each value of the hyper-parameters, train and test on the held-out dataChoose the best value and do a final test on the test dataSpam ExampleWord P(w|spam) P(w|ham) Tot Spam Tot Ham(prior) 0.33333 0.66666 -1.1 -0.4Gary 0.00002 0.00021 -11.8 -8.9would 0.00069 0.00084 -19.1 -16.0you 0.00881 0.00304 -23.8 -21.8like 0.00086 0.00083 -30.9 -28.9to 0.01517 0.01339 -35.1 -33.2lose 0.00008 0.00002 -44.5 -44.0weight 0.00016 0.00002 -53.3 -55.0while 0.00027 0.00027 -61.5 -63.2you 0.00881 0.00304 -66.2 -69.0sleep 0.00006 0.00001 -76.0 -80.5P(spam | w) = 0.989Confidences from a ClassifierThe confidence of a probabilistic classifier:Posterior over the top labelRepresents how sure the classifier is of the classificationAny probabilistic model will have confidencesNo guarantee confidence is correctCalibrationWeak calibration: higher confidences mean higher accuracyStrong calibration: confidence predicts accuracy rateWhat’s the value of calibration?Precision vs. RecallLet’s say we want to classify web pages ashomepages or notIn a test set of 1K pages, there are 3 homepagesOur classifier says they are all non-homepages99.7 accuracy!Need new measures for rare positive eventsPrecision: fraction of guessed positives which were actually positiveRecall: fraction of actual positives which were guessed as positiveSay we guess 5 homepages, of which 2 were actually homepagesPrecision: 2 correct / 5 guessed = 0.4Recall: 2 correct / 3 true = 0.67Which is more important in customer support email automation?Which is more important in airport face recognition?-guessed +actual +Precision vs. RecallPrecision/recall tradeoffOften, you can trade off precision and recallOnly works well with weakly calibrated classifiersTo summarize the tradeoff:Break-even point: precision value when p = rF-measure: harmonic mean of p and r:Errors, and What to DoExamples of errorsDear GlobalSCAPE Customer, GlobalSCAPE has partnered with ScanSoft to offer you the latest version of OmniPage Pro, for just $99.99* - the regular list price is $499! The most common question we've received about this offer is - Is this genuine? We would like to assure you that this offer is authorized by ScanSoft, is genuine and valid. You can get the . . .. . . To receive your $30 Amazon.com promotional certificate, click through to http://www.amazon.com/appareland see the prominent link for the $30 offer. All details are there. We hope you enjoyed receiving this message. However, if you'd rather not receive future e-mails announcing new store launches, please click . . .What to Do About Errors?Need more features– words aren’t enough!Have you emailed the sender before?Have 1K other people just gotten the same email?Is the sending information consistent? Is the email in ALL CAPS?Do inline URLs point where they say they point?Does the email address you by (your) name?Naïve Bayes models can incorporate a variety of features, but tend to do best in homogeneous cases (e.g. all features are word occurrences)FeaturesA feature is a function which signals a property of the inputExamples:ALL_CAPS: value is 1 iff email in all capsHAS_URL: value is 1 iff email has a URLNUM_URLS: number of URLs in emailVERY_LONG: 1 iff email is longer than 1KSUSPICIOUS_SENDER: 1 iff reply-to domain doesn’t match originating serverFeatures are anything you can think of code to evaluate on an
View Full Document