UT Dallas CS 6375 - evaluation - D3105890

Home> Schools> University of Texas at Dallas> Computer Science (CS) > CS 6375> evaluation

DOC PREVIEW

UT Dallas CS 6375 - evaluation

School name University of Texas at Dallas

Course Cs 6375- Machine Learning

Pages 20

This preview shows page 1-2-19-20 out of 20 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 20 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 20 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 20 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 20 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 20 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

1CS6375 Machine LearningEvaluationInstructor: Yang LiuSlides adopted from Rich Caruana, Ray Mooney, Tom Dietterich 2Today Performance measure Accuracy ROC Precision/recall Comparing different classifiers3Performance Measure: Classification Accuracy (binary) Target: 0/1 -1/+1, true/false, … Prediction=f(x) 0/1 real value Threshold: f(x)>threshold => 1; else => 0 Accuracy=#right / #total4Confusion MatrixPredicted 1Predicted 0True 1True 0abcdAccuracy=(a+d)/(a+b+c+d)correctincorrect5TerminologyPredicted 1Predicted 0True 1True 0True positiveTPHitsFalse negativeFNMissesFalse positiveFPFalse alarmsTrue negativeTNCorrect rejections6Problems with Accuracy Assumes equal cost for both kinds of errors Is 99% accuracy good? Is 10% bad? Can be excellent, good, mediocre, poor, terrible Depends on the problem Base rate (or chance performance): accuracy of predicting predominant class (for most problems, obtaining this is easy)7Percent Reduction in Error 80% accuracy = 20% error rate Learning increases accuracy from 80% to 90% Error reduced from 20% to 10% Relative reduction is 50%8Costs – Adding Error WeightsPredicted 1Predicted 0True 1True 0wawbwcwdError weight can also be taken into account when building classifiers,Goal is to minimize weighted error rate9Receiver Operator Characteristic (ROC) Developed in WWII to statistically model false positive and false negative detections of radar operators Standard measure in medicine and biology Used a lot in ML too10ROC Plot Sweep threshold (f(x)>threshold => 1; else => 0) and plot True positive rate vs. false positive rate Sensitivity vs. 1-specificity Sensitivity=a/(a+b) =recall (will discuss it later) 1-specificity=1-d/(c+d)=c/(c+d) Calculate the area under the curve (AUC) Represents performance averaged over all possible points1112Properties of ROC ROC area:  1.0: perfect prediction <0.5: something is wrong Slope is non-increasing Each point on ROC represents different tradeoff between false positives and false negatives If two ROC curves do not intersect, one method dominates the other If two curves intersect, one is better for some region, other is better for other cost ratios13Precision and Recall Used in information retrieval (and other detection tasks) Recall How many of the true positives does the model return Precision How many of the returned documents are correct  F-measure = (equal weight)recallprecisionrecallprecision+**214Example A document collection has 1 mil docs For a given query, there are 1000 relevant docs Search engine returns 1500 relevant docs Among them 700 are correct Recall=? Precision=?15Precision and Recall Recall=a/(a+b) Precision=a/(a+c) Precision/recall curve: sweep thresholds Break even point: precision=recallPredicted 1 Predicted 0True 1True 0abcd16Precision-recall Curve17Summary of Performance Measure Accuracy may not be sufficient or appropriate Many other metrics exist Curves allow you to look at a range of items The measure you optimize to makes a difference The measure you report makes a difference Use measure appropriate for your problem and the community Not all of these generalize easily to >2 classes18Confidence Interval19Evaluating Inductive Hypotheses Accuracy of hypotheses on training data is obviously biased since the hypothesis was constructed to fit this data. Accuracy must be evaluated on an independent (usually disjoint) test set. The larger the test set is, the more accurate the measured accuracy and the lower the variance observed across different test sets.20Variance in Test Accuracy/Error Let errorS(h) denote the percentage of examples in an independently sampled test set S of size n that are incorrectly classified by hypothesis h. Let errorD(h) denote the true error rate for the overall data distribution D. When n is big, the central limit theorem ensures that the distribution of errorS(h) for different random samples will be closely approximated by a normal (Gaussian) distribution.P(errorS(h))errorS(h)errorD(h)21Confidence Intervals When trying to measure the mean of a random variable, if There are a sufficient number of samples The samples are i.i.d (drawn independently from the identical distribution) Then, the random variable can be represented by a Gaussian distribution with the sample mean and variance22Confidence Intervals The true mean will fall in the interval ± zNσof the sample mean with N% confidence where σis the variance and zNgives the width of the interval about the mean that includes N% of the total probability under the Gaussian. zNis drawn from a pre-calculated table. Note that while the test sets are independent in n-way CV, the training sets are not since they overlap (still a decent approximation)23Confidence Intervals Calculate error on the test set of size n, errorS(h)  Compute a confidence internal on this estimate Standard error or sample variance of this estimate is Confidence interval on the true error is For a 95% confidence interval, Z0.025=1.96 nherrorherrorss))(1()( −errors(h) ± zNerrors(h)(1− errors(h))n24Example Your classifier’s error rate on a test set with 1000 samples is 15% 95% confidence interval:25Significance test26Statistical Significance When can we say that one learning algorithm is better than another for a particular task or type of tasks? Is a particular hypothesis really better than another one because its accuracy is higher on a validation set? For example, if learning algorithm 1 gets 95% accuracy and learning algorithm 2 gets 93% on a task, can we say with some confidence that algorithm 1 is superior for that task?27Comparing Two Learned Hypotheses When evaluating two hypotheses, their observed ordering with respect to accuracy may or may not reflect the ordering of their true accuracies. Assume h1is tested on test set S1of size n1 Assume h2is tested on test set S2of size n2P(errorS(h))errorS(h)errorS1(h1)errorS2(h2)Observe h1more accurate than h228Comparing Two Learned Hypotheses When evaluating two hypotheses, their observed ordering with respect to accuracy may or may not reflect the ordering of their true accuracies. Assume h1is tested on test set S1of size n1 Assume h2is tested on test set S2of size n2P(errorS(h))errorS(h)errorS1(h1)errorS2(h2)Observe

View Full Document


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-19-20 out of 20 pages.

UT Dallas CS 6375 - evaluation

Sign up for free to view:

Please select your school