1CS6375 Machine LearningEvaluationInstructor: Yang LiuSlides adopted from Rich Caruana, Ray Mooney, Tom Dietterich 2Today Performance measure Accuracy ROC Precision/recall Comparing different classifiers3Performance Measure: Classification Accuracy (binary) Target: 0/1 -1/+1, true/false, … Prediction=f(x) 0/1 real value Threshold: f(x)>threshold => 1; else => 0 Accuracy=#right / #total4Confusion MatrixPredicted 1Predicted 0True 1True 0abcdAccuracy=(a+d)/(a+b+c+d)correctincorrect5TerminologyPredicted 1Predicted 0True 1True 0True positiveTPHitsFalse negativeFNMissesFalse positiveFPFalse alarmsTrue negativeTNCorrect rejections6Problems with Accuracy Assumes equal cost for both kinds of errors Is 99% accuracy good? Is 10% bad? Can be excellent, good, mediocre, poor, terrible Depends on the problem Base rate (or chance performance): accuracy of predicting predominant class (for most problems, obtaining this is easy)7Percent Reduction in Error 80% accuracy = 20% error rate Learning increases accuracy from 80% to 90% Error reduced from 20% to 10% Relative reduction is 50%8Costs – Adding Error WeightsPredicted 1Predicted 0True 1True 0wawbwcwdError weight can also be taken into account when building classifiers,Goal is to minimize weighted error rate9Receiver Operator Characteristic (ROC) Developed in WWII to statistically model false positive and false negative detections of radar operators Standard measure in medicine and biology Used a lot in ML too10ROC Plot Sweep threshold (f(x)>threshold => 1; else => 0) and plot True positive rate vs. false positive rate Sensitivity vs. 1-specificity Sensitivity=a/(a+b) =recall (will discuss it later) 1-specificity=1-d/(c+d)=c/(c+d) Calculate the area under the curve (AUC) Represents performance averaged over all possible points1112Properties of ROC ROC area: 1.0: perfect prediction <0.5: something is wrong Slope is non-increasing Each point on ROC represents different tradeoff between false positives and false negatives If two ROC curves do not intersect, one method dominates the other If two curves intersect, one is better for some region, other is better for other cost ratios13Precision and Recall Used in information retrieval (and other detection tasks) Recall How many of the true positives does the model return Precision How many of the returned documents are correct F-measure = (equal weight)recallprecisionrecallprecision+**214Example A document collection has 1 mil docs For a given query, there are 1000 relevant docs Search engine returns 1500 relevant docs Among them 700 are correct Recall=? Precision=?15Precision and Recall Recall=a/(a+b) Precision=a/(a+c) Precision/recall curve: sweep thresholds Break even point: precision=recallPredicted 1 Predicted 0True 1True 0abcd16Precision-recall Curve17Summary of Performance Measure Accuracy may not be sufficient or appropriate Many other metrics exist Curves allow you to look at a range of items The measure you optimize to makes a difference The measure you report makes a difference Use measure appropriate for your problem and the community Not all of these generalize easily to >2 classes18Confidence Interval19Evaluating Inductive Hypotheses Accuracy of hypotheses on training data is obviously biased since the hypothesis was constructed to fit this data. Accuracy must be evaluated on an independent (usually disjoint) test set. The larger the test set is, the more accurate the measured accuracy and the lower the variance observed across different test sets.20Variance in Test Accuracy/Error Let errorS(h) denote the percentage of examples in an independently sampled test set S of size n that are incorrectly classified by hypothesis h. Let errorD(h) denote the true error rate for the overall data distribution D. When n is big, the central limit theorem ensures that the distribution of errorS(h) for different random samples will be closely approximated by a normal (Gaussian) distribution.P(errorS(h))errorS(h)errorD(h)21Confidence Intervals When trying to measure the mean of a random variable, if There are a sufficient number of samples The samples are i.i.d (drawn independently from the identical distribution) Then, the random variable can be represented by a Gaussian distribution with the sample mean and variance22Confidence Intervals The true mean will fall in the interval ± zNσof the sample mean with N% confidence where σis the variance and zNgives the width of the interval about the mean that includes N% of the total probability under the Gaussian. zNis drawn from a pre-calculated table. Note that while the test sets are independent in n-way CV, the training sets are not since they overlap (still a decent approximation)23Confidence Intervals Calculate error on the test set of size n, errorS(h) Compute a confidence internal on this estimate Standard error or sample variance of this estimate is Confidence interval on the true error is For a 95% confidence interval, Z0.025=1.96 nherrorherrorss))(1()( −errors(h) ± zNerrors(h)(1− errors(h))n24Example Your classifier’s error rate on a test set with 1000 samples is 15% 95% confidence interval:25Significance test26Statistical Significance When can we say that one learning algorithm is better than another for a particular task or type of tasks? Is a particular hypothesis really better than another one because its accuracy is higher on a validation set? For example, if learning algorithm 1 gets 95% accuracy and learning algorithm 2 gets 93% on a task, can we say with some confidence that algorithm 1 is superior for that task?27Comparing Two Learned Hypotheses When evaluating two hypotheses, their observed ordering with respect to accuracy may or may not reflect the ordering of their true accuracies. Assume h1is tested on test set S1of size n1 Assume h2is tested on test set S2of size n2P(errorS(h))errorS(h)errorS1(h1)errorS2(h2)Observe h1more accurate than h228Comparing Two Learned Hypotheses When evaluating two hypotheses, their observed ordering with respect to accuracy may or may not reflect the ordering of their true accuracies. Assume h1is tested on test set S1of size n1 Assume h2is tested on test set S2of size n2P(errorS(h))errorS(h)errorS1(h1)errorS2(h2)Observe
View Full Document