Experimentally Evaluating Classifiers William Cohen PRACTICAL LESSONS This loss function overfits Overfitting is often a problem in supervised learning When you fit the data minimize loss are you fitting real structure in the data or noise in the data Will the patterns you see appear in a test set or not hi error Error Loss on an unseen test set Dtest Error Loss on training set D more features 3 Discussion of Kaggle http techtalks tv talks machine lea rning competitions 58340 CONFIDENCE INTERVALS ON THE ERROR RATE ESTIMATED BY A SAMPLE PART 1 THE MAIN IDEA A practical problem You ve just trained a classifier h using YFCL on YFP You tested h on a sample S and the error rate was 0 30 How good is that estimate Should you throw away the old classifier which had an error rate of 0 35 are replace it with h Can you write a paper saying you ve reduced the best known error rate for YFP from 0 35 to 0 30 Would it be accepted YFCL Your Favorite Classifier Learner YFP Your Favorite Problem Two definitions of error The true error of h with respect to target function f and distribution D is the probability that h will misclassify an instance drawn at random from D errorD h Prx D f x h x The sample error of h with respect to target function f and sample S is the fraction of instances in S that h misclassifies 1 errorS h d f x h x S x S 1 if f x h x where d f x h x else 0 Two definitions of error The true error of h with respect to target function f and distribution D is the probability that h will misclassify an instance drawn at random from D errorD h Prx D f x h x The sample error of h with respect to target function f and sample S is the fraction of instances in S that h misclassifies 1 errorS h d f x h x S x S Usually errorD h is unknown and we use an estimate errorS h How good is this estimate Why sample error is wrong Bias if S is the training set then errorS h is optimistically biased i e errorS h errorD h Bias E errorS h errorD h This is true if S was used at any stage of learning feature engineering parameter testing feature selection You want S and h to be independent A popular split is train development and evaluation Why sample error is wrong Bias if S is independent from the training set and drawn from D then the estimate is unbiased Bias E errorS h errorD h 0 Variance but even if S is independent the errorS h may still vary from 2 errorD h Var E errorS h E errorS h A simple example Hypothesis h misclassifies 12 of 40 examples from S So errorS h 12 40 0 30 What is errorD h Is it less than 0 35 A simple example Hypothesis h misclassifies 12 of 40 examples from S So errorS h 12 40 0 30 What is errorD h errorS h The event h makes an error on x is a random variable over examples X from D In fact it s a binomial with parameter With r error in n trials MLE of is r n 0 30 A simple example In fact for a binomial we know the whole pmf probability mass function A simple example To pick a confidence interval we need to clarify what s random and what s not Aside credibility intervals What we have is Pr R r Arguably what we want is Pr R r 1 Z Pr R r Pr which would give us a MAP for or an interval that probably contains This isn t common practice A simple example To pick a confidence interval we need to clarify what s random and what s not Commonly h and errorD h are fixed but unknown S is random variable sampling is the experiment R errorS h is a random variable depending on S We ask what other outcomes of the experiment are likely A simple example Is 0 35 14 40 Given this estimate of the probability of a sample S that would make me think that A simple example I can pick a range of such that the probability of a sample that would lead to an estimate outside the range is low Given my estimate of the probability of a sample with fewer than 6 errors or more than A simple example If that s true then 6 40 16 40 is a 95 confidence interval for Given my estimate of the probability of a sample with fewer than 6 errors or more than Confidence intervals You now know how to compute a confidence interval You d want a computer to do it because computing the binomial exactly is a chore If you have enough data then there are some simpler approximations CONFIDENCE INTERVALS ON THE ERROR RATE ESTIMATED BY A SAMPLE PART 2 COMMON APPROXIMATIONS Recipe 1 for confidence intervals Another rule of thumb If it s safe to use this approximation when the S n and n 30 interval is within 0 1 All the samples in S are drawn independently of h and each other errorS h p Then with 95 probability errorD h is in p 1 p p 1 96 n Recipe 2 for confidence intervals If S n and n 30 All the samples in S are drawn independently of h and each other errorS h p Then with N probability errorD h is in p zn For these values of N p 1 p n Why do these recipes work Binomial distribution for R heads in n flips with p Pr heads Expected value of R E R np Variance of R Var R E R E R np 1 p Standard deviation of R s R np 1 p Standard error SE R s R n SE expected distance between a sample mean for a size n sample and E X SD expected distance between a single sample of X and E X Why do these recipes work Binomial distribution for R heads in n flips with p Pr heads Expected value of R E R np Variance of R Var R E R E R np 1 p Standard deviation of R s R np 1 p Standard error SE R s R n p 1 p p 1 96 n p 1 96 SE R n Why do these recipes work So E errorS h errorD h standard deviation of errorS h standard error of one draw from a binomial with parameter p or p 1 p errorS h 1 errorS h n S For large n the binomial mean approximates a normal distribution with same mean and sd Why do these recipes work Rule of thumb is considering large n to be n 30 Why do these recipes work Why do these recipes work Why recipe 2 works By CLT we …
View Full Document
Unlocking...