UT Dallas CS 6375 - 11.evaluation - D3105011

Home> Schools> University of Texas at Dallas> Computer Science (CS) > CS 6375> 11.evaluation

DOC PREVIEW

UT Dallas CS 6375 - 11.evaluation

School name University of Texas at Dallas

Course Cs 6375- Machine Learning

Pages 20

This preview shows page 1-2-19-20 out of 20 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 20 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

CS6375 Machine Learning Evaluation Instructor Yang Liu Slides adopted from Rich Caruana Ray Mooney Tom Dietterich 1 Today Performance measure Accuracy ROC Precision recall Comparing different classifiers 2 Performance Measure Classification Accuracy binary Target 0 1 1 1 true false Prediction f x 0 1 real value Threshold f x threshold 1 else 0 Accuracy right total 3 Confusion Matrix Predicted 1 Predicted 0 True 1 a b True 0 c d correct incorrect Accuracy a d a b c d 4 Terminology Predicted 1 Predicted 0 True 1 True positive TP Hits False negative FN Misses True 0 False positive FP False alarms True negative TN Correct rejections 5 Problems with Accuracy Assumes equal cost for both kinds of errors Is 99 accuracy good Is 10 bad Can be excellent good mediocre poor terrible Depends on the problem Base rate or chance performance accuracy of predicting predominant class for most problems obtaining this is easy 6 Percent Reduction in Error 80 accuracy 20 error rate Learning increases accuracy from 80 to 90 Error reduced from 20 to 10 Relative reduction is 50 7 Costs Adding Error Weights Predicted 1 Predicted 0 True 1 wa wb True 0 wc wd Error weight can also be taken into account when building classifiers Goal is to minimize weighted error rate 8 Receiver Operator Characteristic ROC Developed in WWII to statistically model false positive and false negative detections of radar operators Standard measure in medicine and biology Used a lot in ML too 9 ROC Plot Sweep threshold f x threshold 1 else 0 and plot True positive rate vs false positive rate Sensitivity vs 1 specificity Sensitivity a a b recall will discuss it later 1 specificity 1 d c d c c d Calculate the area under the curve AUC Represents performance averaged over all possible points 10 11 Properties of ROC ROC area 1 0 perfect prediction 0 5 something is wrong Slope is non increasing Each point on ROC represents different tradeoff between false positives and false negatives If two ROC curves do not intersect one method dominates the other If two curves intersect one is better for some region other is better for other cost ratios 12 Precision and Recall Used in information retrieval and other detection tasks Recall How many of the true positives does the model return Precision How many of the returned documents are correct F measure 2 precision recall equal weight precision recall 13 Example A document collection has 1 mil docs For a given query there are 1000 relevant docs Search engine returns 1500 relevant docs Among them 700 are correct Recall Precision 14 Precision and Recall Predicted 1 Predicted 0 True 1 a b True 0 c d Recall a a b Precision a a c Precision recall curve sweep thresholds Break even point precision recall 15 Precision recall Curve 16 Summary of Performance Measure Accuracy may not be sufficient or appropriate Many other metrics exist Curves allow you to look at a range of items The measure you optimize to makes a difference The measure you report makes a difference Use measure appropriate for your problem and the community Not all of these generalize easily to 2 classes 17 Confidence Interval 18 Evaluating Inductive Hypotheses Accuracy of hypotheses on training data is obviously biased since the hypothesis was constructed to fit this data Accuracy must be evaluated on an independent usually disjoint test set The larger the test set is the more accurate the measured accuracy and the lower the variance observed across different test sets 19 Variance in Test Accuracy Error P errorS h Let errorS h denote the percentage of examples in an independently sampled test set S of size n that are incorrectly classified by hypothesis h Let errorD h denote the true error rate for the overall data distribution D When n is big the central limit theorem ensures that the distribution of errorS h for different random samples will be closely approximated by a normal Gaussian distribution errorD h errorS h 20 Confidence Intervals When trying to measure the mean of a random variable if There are a sufficient number of samples The samples are i i d drawn independently from the identical distribution Then the random variable can be represented by a Gaussian distribution with the sample mean and variance 21 Confidence Intervals The true mean will fall in the interval zN of the sample mean with N confidence where is the variance and zN gives the width of the interval about the mean that includes N of the total probability under the Gaussian zN is drawn from a pre calculated table Note that while the test sets are independent in n way CV the training sets are not since they overlap still a decent approximation 22 Confidence Intervals Calculate error on the test set of size n errorS h Compute a confidence internal on this estimate Standard error or sample variance of this estimate is errors h 1 errors h n Confidence interval on the true error is errors h zN errors h 1 errors h n For a 95 confidence interval Z0 025 1 96 23 Example Your classifier s error rate on a test set with 1000 samples is 15 95 confidence interval 24 Significance test 25 Statistical Significance When can we say that one learning algorithm is better than another for a particular task or type of tasks Is a particular hypothesis really better than another one because its accuracy is higher on a validation set For example if learning algorithm 1 gets 95 accuracy and learning algorithm 2 gets 93 on a task can we say with some confidence that algorithm 1 is superior for that task 26 Comparing Two Learned Hypotheses When evaluating two hypotheses their observed ordering with respect to accuracy may or may not reflect the ordering of their true accuracies P errorS h Assume h1 is tested on test set S1 of size n1 Assume h2 is tested on test set S2 of size n2 errorS1 h1 errorS2 h2 errorS h Observe h1 more accurate than h2 27 Comparing Two Learned Hypotheses When evaluating two hypotheses their observed ordering with respect to accuracy may or may not reflect the ordering of their true accuracies P errorS h Assume h1 is tested on test set S1 of size n1 Assume h2 is tested on test set S2 of size n2 errorS1 h1 errorS2 h2 errorS h Observe h1 less accurate than h2 28 Statistical Hypothesis Testing Determines the probability of the null hypothesis that the two samples were actually drawn from the same underlying distribution We answer a question such as if the hyp were true would it be unlikely to get obtained data By scientific convention we reject the null hypothesis and say the

View Full Document


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-19-20 out of 20 pages.

UT Dallas CS 6375 - 11.evaluation

Sign up for free to view:

Please select your school