DOC PREVIEW
UT Dallas CS 6375 - 11.evaluation

This preview shows page 1-2-19-20 out of 20 pages.

Save
View full document
Premium Document
Do you want full access? Go Premium and unlock all 20 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

CS6375 Machine Learning Evaluation Instructor Yang Liu Slides adopted from Rich Caruana Ray Mooney Tom Dietterich 1 Today Performance measure Accuracy ROC Precision recall Comparing different classifiers 2 Performance Measure Classification Accuracy binary Target 0 1 1 1 true false Prediction f x 0 1 real value Threshold f x threshold 1 else 0 Accuracy right total 3 Confusion Matrix Predicted 1 Predicted 0 True 1 a b True 0 c d correct incorrect Accuracy a d a b c d 4 Terminology Predicted 1 Predicted 0 True 1 True positive TP Hits False negative FN Misses True 0 False positive FP False alarms True negative TN Correct rejections 5 Problems with Accuracy Assumes equal cost for both kinds of errors Is 99 accuracy good Is 10 bad Can be excellent good mediocre poor terrible Depends on the problem Base rate or chance performance accuracy of predicting predominant class for most problems obtaining this is easy 6 Percent Reduction in Error 80 accuracy 20 error rate Learning increases accuracy from 80 to 90 Error reduced from 20 to 10 Relative reduction is 50 7 Costs Adding Error Weights Predicted 1 Predicted 0 True 1 wa wb True 0 wc wd Error weight can also be taken into account when building classifiers Goal is to minimize weighted error rate 8 Receiver Operator Characteristic ROC Developed in WWII to statistically model false positive and false negative detections of radar operators Standard measure in medicine and biology Used a lot in ML too 9 ROC Plot Sweep threshold f x threshold 1 else 0 and plot True positive rate vs false positive rate Sensitivity vs 1 specificity Sensitivity a a b recall will discuss it later 1 specificity 1 d c d c c d Calculate the area under the curve AUC Represents performance averaged over all possible points 10 11 Properties of ROC ROC area 1 0 perfect prediction 0 5 something is wrong Slope is non increasing Each point on ROC represents different tradeoff between false positives and false negatives If two ROC curves do not intersect one method dominates the other If two curves intersect one is better for some region other is better for other cost ratios 12 Precision and Recall Used in information retrieval and other detection tasks Recall How many of the true positives does the model return Precision How many of the returned documents are correct F measure 2 precision recall equal weight precision recall 13 Example A document collection has 1 mil docs For a given query there are 1000 relevant docs Search engine returns 1500 relevant docs Among them 700 are correct Recall Precision 14 Precision and Recall Predicted 1 Predicted 0 True 1 a b True 0 c d Recall a a b Precision a a c Precision recall curve sweep thresholds Break even point precision recall 15 Precision recall Curve 16 Summary of Performance Measure Accuracy may not be sufficient or appropriate Many other metrics exist Curves allow you to look at a range of items The measure you optimize to makes a difference The measure you report makes a difference Use measure appropriate for your problem and the community Not all of these generalize easily to 2 classes 17 Confidence Interval 18 Evaluating Inductive Hypotheses Accuracy of hypotheses on training data is obviously biased since the hypothesis was constructed to fit this data Accuracy must be evaluated on an independent usually disjoint test set The larger the test set is the more accurate the measured accuracy and the lower the variance observed across different test sets 19 Variance in Test Accuracy Error P errorS h Let errorS h denote the percentage of examples in an independently sampled test set S of size n that are incorrectly classified by hypothesis h Let errorD h denote the true error rate for the overall data distribution D When n is big the central limit theorem ensures that the distribution of errorS h for different random samples will be closely approximated by a normal Gaussian distribution errorD h errorS h 20 Confidence Intervals When trying to measure the mean of a random variable if There are a sufficient number of samples The samples are i i d drawn independently from the identical distribution Then the random variable can be represented by a Gaussian distribution with the sample mean and variance 21 Confidence Intervals The true mean will fall in the interval zN of the sample mean with N confidence where is the variance and zN gives the width of the interval about the mean that includes N of the total probability under the Gaussian zN is drawn from a pre calculated table Note that while the test sets are independent in n way CV the training sets are not since they overlap still a decent approximation 22 Confidence Intervals Calculate error on the test set of size n errorS h Compute a confidence internal on this estimate Standard error or sample variance of this estimate is errors h 1 errors h n Confidence interval on the true error is errors h zN errors h 1 errors h n For a 95 confidence interval Z0 025 1 96 23 Example Your classifier s error rate on a test set with 1000 samples is 15 95 confidence interval 24 Significance test 25 Statistical Significance When can we say that one learning algorithm is better than another for a particular task or type of tasks Is a particular hypothesis really better than another one because its accuracy is higher on a validation set For example if learning algorithm 1 gets 95 accuracy and learning algorithm 2 gets 93 on a task can we say with some confidence that algorithm 1 is superior for that task 26 Comparing Two Learned Hypotheses When evaluating two hypotheses their observed ordering with respect to accuracy may or may not reflect the ordering of their true accuracies P errorS h Assume h1 is tested on test set S1 of size n1 Assume h2 is tested on test set S2 of size n2 errorS1 h1 errorS2 h2 errorS h Observe h1 more accurate than h2 27 Comparing Two Learned Hypotheses When evaluating two hypotheses their observed ordering with respect to accuracy may or may not reflect the ordering of their true accuracies P errorS h Assume h1 is tested on test set S1 of size n1 Assume h2 is tested on test set S2 of size n2 errorS1 h1 errorS2 h2 errorS h Observe h1 less accurate than h2 28 Statistical Hypothesis Testing Determines the probability of the null hypothesis that the two samples were actually drawn from the same underlying distribution We answer a question such as if the hyp were true would it be unlikely to get obtained data By scientific convention we reject the null hypothesis and say the


View Full Document

UT Dallas CS 6375 - 11.evaluation

Documents in this Course
ensemble

ensemble

17 pages

em

em

17 pages

dtree

dtree

41 pages

cv

cv

9 pages

bayes

bayes

19 pages

vc

vc

24 pages

svm-2

svm-2

16 pages

svm-1

svm-1

18 pages

rl

rl

18 pages

mle

mle

16 pages

mdp

mdp

19 pages

knn

knn

11 pages

intro

intro

19 pages

hmm-train

hmm-train

26 pages

hmm

hmm

28 pages

hmm-train

hmm-train

26 pages

hmm

hmm

28 pages

ensemble

ensemble

17 pages

em

em

17 pages

dtree

dtree

41 pages

cv

cv

9 pages

bayes

bayes

19 pages

vc

vc

24 pages

svm-2

svm-2

16 pages

svm-1

svm-1

18 pages

rl

rl

18 pages

mle

mle

16 pages

mdp

mdp

19 pages

knn

knn

11 pages

intro

intro

19 pages

vc

vc

24 pages

svm-2

svm-2

16 pages

svm-1

svm-1

18 pages

rl

rl

18 pages

mle

mle

16 pages

mdp

mdp

19 pages

knn

knn

11 pages

intro

intro

19 pages

hmm-train

hmm-train

26 pages

hmm

hmm

28 pages

ensemble

ensemble

17 pages

em

em

17 pages

dtree

dtree

41 pages

cv

cv

9 pages

bayes

bayes

19 pages

vc

vc

24 pages

svm-2

svm-2

16 pages

svm-1

svm-1

18 pages

rl

rl

18 pages

mle

mle

16 pages

mdp

mdp

19 pages

knn

knn

11 pages

intro

intro

19 pages

hmm-train

hmm-train

26 pages

hmm

hmm

28 pages

ensemble

ensemble

17 pages

em

em

17 pages

dtree

dtree

41 pages

cv

cv

9 pages

bayes

bayes

19 pages

hw2

hw2

2 pages

hw1

hw1

4 pages

hw0

hw0

2 pages

hw5

hw5

2 pages

hw3

hw3

3 pages

20.mdp

20.mdp

19 pages

19.em

19.em

17 pages

16.svm-2

16.svm-2

16 pages

15.svm-1

15.svm-1

18 pages

14.vc

14.vc

24 pages

9.hmm

9.hmm

28 pages

5.mle

5.mle

16 pages

3.bayes

3.bayes

19 pages

2.dtree

2.dtree

41 pages

1.intro

1.intro

19 pages

21.rl

21.rl

18 pages

CNF-DNF

CNF-DNF

2 pages

ID3

ID3

4 pages

mlHw6

mlHw6

3 pages

MLHW3

MLHW3

4 pages

MLHW4

MLHW4

3 pages

ML-HW2

ML-HW2

3 pages

vcdimCMU

vcdimCMU

20 pages

hw0

hw0

2 pages

hw3

hw3

3 pages

hw2

hw2

2 pages

hw1

hw1

4 pages

9.hmm

9.hmm

28 pages

5.mle

5.mle

16 pages

3.bayes

3.bayes

19 pages

2.dtree

2.dtree

41 pages

1.intro

1.intro

19 pages

15.svm-1

15.svm-1

18 pages

14.vc

14.vc

24 pages

hw2

hw2

2 pages

hw1

hw1

4 pages

hw0

hw0

2 pages

hw3

hw3

3 pages

9.hmm

9.hmm

28 pages

5.mle

5.mle

16 pages

3.bayes

3.bayes

19 pages

2.dtree

2.dtree

41 pages

1.intro

1.intro

19 pages

Load more
Download 11.evaluation
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view 11.evaluation and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view 11.evaluation and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?