CS 188: Artificial Intelligence Fall 2007General Naïve BayesExample: Spam FilteringExample: OCRExample: OverfittingGeneralization and OverfittingEstimation: SmoothingSlide 8Estimation: Laplace SmoothingSlide 10Estimation: Linear InterpolationReal NB: SmoothingTuning on Held-Out DataBaselinesConfidences from a ClassifierGenerative vs. DiscriminativeErrors, and What to DoWhat to Do About Errors?FeaturesFeature ExtractorsSome (Vague) BiologyThe Binary PerceptronExample: SpamBinary Decision RuleThe Multiclass PerceptronExampleThe Perceptron Update RuleSlide 28Examples: PerceptronMistake-Driven ClassificationProperties of PerceptronsSlide 32Issues with PerceptronsLinear SeparatorsSupport Vector MachinesSummarySimilarity FunctionsCase-Based ReasoningParametric / Non-parametricNearest-Neighbor ClassificationBasic SimilarityInvariant MetricsRotation Invariant MetricsTangent FamiliesTemplate DeformationCS 188: Artificial IntelligenceFall 2007Lecture 24: Perceptrons11/20/2007Dan Klein – UC BerkeleyGeneral Naïve BayesA general naive Bayes model:We only specify how each feature depends on the classTotal number of parameters is linear in nCE1EnE2|C| parametersn x |E| x |C| parameters|C| x |E|n parametersExample: Spam FilteringModel:Parameters:the : 0.016to : 0.015and : 0.012...free : 0.001click : 0.001...morally : 0.001nicely : 0.001...the : 0.021to : 0.013and : 0.011...free : 0.005click : 0.004...screens : 0.000minute : 0.000...ham : 0.66spam: 0.33Example: OCR1 0.12 0.13 0.14 0.15 0.16 0.17 0.18 0.19 0.10 0.11 0.012 0.053 0.054 0.305 0.806 0.907 0.058 0.609 0.500 0.801 0.052 0.013 0.904 0.805 0.906 0.907 0.258 0.859 0.600 0.80Example: Overfitting2 wins!!Generalization and OverfittingRelative frequency parameters will overfit the training data!Unlikely that every occurrence of “minute” is 100% spamUnlikely that every occurrence of “seriously” is 100% hamWhat about all the words that don’t occur in the training set?In general, we can’t go around giving unseen events zero probabilityAs an extreme case, imagine using the entire email as the only featureWould get the training data perfect (if deterministic labeling)Wouldn’t generalize at allJust making the bag-of-words assumption gives us some generalization, but isn’t enoughTo generalize better: we need to smooth or regularize the estimatesEstimation: SmoothingProblems with maximum likelihood estimates:If I flip a coin once, and it’s heads, what’s the estimate for P(heads)?What if I flip 10 times with 8 heads?What if I flip 10M times with 8M heads?Basic idea:We have some prior expectation about parameters (here, the probability of heads)Given little evidence, we should skew towards our priorGiven a lot of evidence, we should listen to the dataEstimation: SmoothingRelative frequencies are the maximum likelihood estimatesIn Bayesian statistics, we think of the parameters as just another random variable, with its own distribution????Estimation: Laplace SmoothingLaplace’s estimate:Pretend you saw every outcome once more than you actually didCan derive this as a MAP estimate with Dirichlet priors (see cs281a)H H TEstimation: Laplace SmoothingLaplace’s estimate (extended):Pretend you saw every outcome k extra timesWhat’s Laplace with k = 0?k is the strength of the priorLaplace for conditionals:Smooth each condition independently:H H TEstimation: Linear Interpolation In practice, Laplace often performs poorly for P(X|Y):When |X| is very largeWhen |Y| is very largeAnother option: linear interpolationAlso get P(X) from the dataMake sure the estimate of P(X|Y) isn’t too different from P(X)What if is 0? 1?For even better ways to estimate parameters, as well as details of the math see cs281a, cs294Real NB: SmoothingFor real classification problems, smoothing is criticalNew odds ratios:helvetica : 11.4seems : 10.8group : 10.2ago : 8.4areas : 8.3...verdana : 28.8Credit : 28.4ORDER : 27.2<FONT> : 26.9money : 26.5...Do these make more sense?Tuning on Held-Out DataNow we’ve got two kinds of unknownsParameters: the probabilities P(Y|X), P(Y)Hyperparameters, like the amount of smoothing to do: k, Where to learn?Learn parameters from training dataMust tune hyperparameters on different dataWhy?For each value of the hyperparameters, train and test on the held-out dataChoose the best value and do a final test on the test dataBaselinesFirst task: get a baselineBaselines are very simple “straw man” proceduresHelp determine how hard the task isHelp know what a “good” accuracy isWeak baseline: most frequent label classifierGives all test instances whatever label was most common in the training setE.g. for spam filtering, might label everything as hamAccuracy might be very high if the problem is skewedFor real research, usually use previous work as a (strong) baselineConfidences from a ClassifierThe confidence of a probabilistic classifier:Posterior over the top labelRepresents how sure the classifier is of the classificationAny probabilistic model will have confidencesNo guarantee confidence is correctCalibrationWeak calibration: higher confidences mean higher accuracyStrong calibration: confidence predicts accuracy rateWhat’s the value of calibration?Generative vs. DiscriminativeGenerative classifiers:E.g. naïve BayesWe build a causal model of the variablesWe then query that model for causes, given evidenceDiscriminative classifiers:E.g. perceptron (next)No causal model, no Bayes rule, often no probabilitiesTry to predict output directlyLoosely: mistake driven rather than model drivenErrors, and What to DoExamples of errorsDear GlobalSCAPE Customer, GlobalSCAPE has partnered with ScanSoft to offer you the latest version of OmniPage Pro, for just $99.99* - the regular list price is $499! The most common question we've received about this offer is - Is this genuine? We would like to assure you that this offer is authorized by ScanSoft, is genuine and valid. You can get the . . .. . . To receive your $30 Amazon.com promotional certificate, click through to http://www.amazon.com/appareland see the prominent link for the $30 offer. All details are there. We hope you enjoyed receiving this message. However, if you'd
View Full Document