DOC PREVIEW
UT Dallas CS 6375 - 11.evaluation

This preview shows page 1-2-19-20 out of 20 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 20 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 20 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 20 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 20 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 20 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

1CS6375 Machine LearningEvaluationInstructor: Yang LiuSlides adopted from Rich Caruana, Ray Mooney, Tom Dietterich 2Today Performance measure Accuracy ROC Precision/recall Comparing different classifiers3Performance Measure: Classification Accuracy (binary) Target: 0/1 -1/+1, true/false, … Prediction=f(x) 0/1 real value Threshold: f(x)>threshold => 1; else => 0 Accuracy=#right / #total4Confusion MatrixPredicted 1Predicted 0True 1True 0abcdAccuracy=(a+d)/(a+b+c+d)correctincorrect5TerminologyPredicted 1Predicted 0True 1True 0True positiveTPHitsFalse negativeFNMissesFalse positiveFPFalse alarmsTrue negativeTNCorrect rejections6Problems with Accuracy Assumes equal cost for both kinds of errors Is 99% accuracy good? Is 10% bad? Can be excellent, good, mediocre, poor, terrible Depends on the problem Base rate (or chance performance): accuracy of predicting predominant class (for most problems, obtaining this is easy)7Percent Reduction in Error 80% accuracy = 20% error rate Learning increases accuracy from 80% to 90% Error reduced from 20% to 10% Relative reduction is 50%8Costs – Adding Error WeightsPredicted 1Predicted 0True 1True 0wawbwcwdError weight can also be taken into account when building classifiers,Goal is to minimize weighted error rate9Receiver Operator Characteristic (ROC) Developed in WWII to statistically model false positive and false negative detections of radar operators Standard measure in medicine and biology Used a lot in ML too10ROC Plot Sweep threshold (f(x)>threshold => 1; else => 0) and plot True positive rate vs. false positive rate Sensitivity vs. 1-specificity Sensitivity=a/(a+b) =recall (will discuss it later) 1-specificity=1-d/(c+d)=c/(c+d) Calculate the area under the curve (AUC) Represents performance averaged over all possible points1112Properties of ROC ROC area:  1.0: perfect prediction <0.5: something is wrong Slope is non-increasing Each point on ROC represents different tradeoff between false positives and false negatives If two ROC curves do not intersect, one method dominates the other If two curves intersect, one is better for some region, other is better for other cost ratios13Precision and Recall Used in information retrieval (and other detection tasks) Recall How many of the true positives does the model return Precision How many of the returned documents are correct  F-measure = (equal weight)recallprecisionrecallprecision+**214Example A document collection has 1 mil docs For a given query, there are 1000 relevant docs Search engine returns 1500 relevant docs Among them 700 are correct Recall=? Precision=?15Precision and Recall Recall=a/(a+b) Precision=a/(a+c) Precision/recall curve: sweep thresholds Break even point: precision=recallPredicted 1 Predicted 0True 1True 0abcd16Precision-recall Curve17Summary of Performance Measure Accuracy may not be sufficient or appropriate Many other metrics exist Curves allow you to look at a range of items The measure you optimize to makes a difference The measure you report makes a difference Use measure appropriate for your problem and the community Not all of these generalize easily to >2 classes18Confidence Interval19Evaluating Inductive Hypotheses Accuracy of hypotheses on training data is obviously biased since the hypothesis was constructed to fit this data. Accuracy must be evaluated on an independent (usually disjoint) test set. The larger the test set is, the more accurate the measured accuracy and the lower the variance observed across different test sets.20Variance in Test Accuracy/Error Let errorS(h) denote the percentage of examples in an independently sampled test set S of size n that are incorrectly classified by hypothesis h. Let errorD(h) denote the true error rate for the overall data distribution D. When n is big, the central limit theorem ensures that the distribution of errorS(h) for different random samples will be closely approximated by a normal (Gaussian) distribution.P(errorS(h))errorS(h)errorD(h)21Confidence Intervals When trying to measure the mean of a random variable, if There are a sufficient number of samples The samples are i.i.d (drawn independently from the identical distribution) Then, the random variable can be represented by a Gaussian distribution with the sample mean and variance22Confidence Intervals The true mean will fall in the interval ± zNσof the sample mean with N% confidence where σis the variance and zNgives the width of the interval about the mean that includes N% of the total probability under the Gaussian. zNis drawn from a pre-calculated table. Note that while the test sets are independent in n-way CV, the training sets are not since they overlap (still a decent approximation)23Confidence Intervals Calculate error on the test set of size n, errorS(h)  Compute a confidence internal on this estimate Standard error or sample variance of this estimate is Confidence interval on the true error is For a 95% confidence interval, Z0.025=1.96 nherrorherrorss))(1()( −errors(h) ± zNerrors(h)(1− errors(h))n24Example Your classifier’s error rate on a test set with 1000 samples is 15% 95% confidence interval:25Significance test26Statistical Significance When can we say that one learning algorithm is better than another for a particular task or type of tasks? Is a particular hypothesis really better than another one because its accuracy is higher on a validation set? For example, if learning algorithm 1 gets 95% accuracy and learning algorithm 2 gets 93% on a task, can we say with some confidence that algorithm 1 is superior for that task?27Comparing Two Learned Hypotheses When evaluating two hypotheses, their observed ordering with respect to accuracy may or may not reflect the ordering of their true accuracies. Assume h1is tested on test set S1of size n1 Assume h2is tested on test set S2of size n2P(errorS(h))errorS(h)errorS1(h1)errorS2(h2)Observe h1more accurate than h228Comparing Two Learned Hypotheses When evaluating two hypotheses, their observed ordering with respect to accuracy may or may not reflect the ordering of their true accuracies. Assume h1is tested on test set S1of size n1 Assume h2is tested on test set S2of size n2P(errorS(h))errorS(h)errorS1(h1)errorS2(h2)Observe


View Full Document

UT Dallas CS 6375 - 11.evaluation

Documents in this Course
ensemble

ensemble

17 pages

em

em

17 pages

dtree

dtree

41 pages

cv

cv

9 pages

bayes

bayes

19 pages

vc

vc

24 pages

svm-2

svm-2

16 pages

svm-1

svm-1

18 pages

rl

rl

18 pages

mle

mle

16 pages

mdp

mdp

19 pages

knn

knn

11 pages

intro

intro

19 pages

hmm-train

hmm-train

26 pages

hmm

hmm

28 pages

hmm-train

hmm-train

26 pages

hmm

hmm

28 pages

ensemble

ensemble

17 pages

em

em

17 pages

dtree

dtree

41 pages

cv

cv

9 pages

bayes

bayes

19 pages

vc

vc

24 pages

svm-2

svm-2

16 pages

svm-1

svm-1

18 pages

rl

rl

18 pages

mle

mle

16 pages

mdp

mdp

19 pages

knn

knn

11 pages

intro

intro

19 pages

vc

vc

24 pages

svm-2

svm-2

16 pages

svm-1

svm-1

18 pages

rl

rl

18 pages

mle

mle

16 pages

mdp

mdp

19 pages

knn

knn

11 pages

intro

intro

19 pages

hmm-train

hmm-train

26 pages

hmm

hmm

28 pages

ensemble

ensemble

17 pages

em

em

17 pages

dtree

dtree

41 pages

cv

cv

9 pages

bayes

bayes

19 pages

vc

vc

24 pages

svm-2

svm-2

16 pages

svm-1

svm-1

18 pages

rl

rl

18 pages

mle

mle

16 pages

mdp

mdp

19 pages

knn

knn

11 pages

intro

intro

19 pages

hmm-train

hmm-train

26 pages

hmm

hmm

28 pages

ensemble

ensemble

17 pages

em

em

17 pages

dtree

dtree

41 pages

cv

cv

9 pages

bayes

bayes

19 pages

hw2

hw2

2 pages

hw1

hw1

4 pages

hw0

hw0

2 pages

hw5

hw5

2 pages

hw3

hw3

3 pages

20.mdp

20.mdp

19 pages

19.em

19.em

17 pages

16.svm-2

16.svm-2

16 pages

15.svm-1

15.svm-1

18 pages

14.vc

14.vc

24 pages

9.hmm

9.hmm

28 pages

5.mle

5.mle

16 pages

3.bayes

3.bayes

19 pages

2.dtree

2.dtree

41 pages

1.intro

1.intro

19 pages

21.rl

21.rl

18 pages

CNF-DNF

CNF-DNF

2 pages

ID3

ID3

4 pages

mlHw6

mlHw6

3 pages

MLHW3

MLHW3

4 pages

MLHW4

MLHW4

3 pages

ML-HW2

ML-HW2

3 pages

vcdimCMU

vcdimCMU

20 pages

hw0

hw0

2 pages

hw3

hw3

3 pages

hw2

hw2

2 pages

hw1

hw1

4 pages

9.hmm

9.hmm

28 pages

5.mle

5.mle

16 pages

3.bayes

3.bayes

19 pages

2.dtree

2.dtree

41 pages

1.intro

1.intro

19 pages

15.svm-1

15.svm-1

18 pages

14.vc

14.vc

24 pages

hw2

hw2

2 pages

hw1

hw1

4 pages

hw0

hw0

2 pages

hw3

hw3

3 pages

9.hmm

9.hmm

28 pages

5.mle

5.mle

16 pages

3.bayes

3.bayes

19 pages

2.dtree

2.dtree

41 pages

1.intro

1.intro

19 pages

Load more
Download 11.evaluation
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view 11.evaluation and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view 11.evaluation 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?