Slide 1Measuring a Program’s PerformanceSources of Test ProblemsThe IPC test data isn’t always adequateSlide 5Statistical significance?How to get a large set of problemsPolynomial versus exponential?Slide 9Slide 10What if a program can’t handle all the data?Slide 12What problem domains to use?If you have difficulty1Running Experiments forYour Term ProjectsDana S. NauCMSC 722, AI PlanningUniversity of MarylandLecture slides forAutomated Planning: Theory and Practice2Measuring a Program’s PerformanceSome of you want to run experiments to measure the performance of a programPossible things to measureRunning time»usually CPU time, not clock timeSolution quality»size? cost? some other measure?Number of problems solvedOther things»E.g., in a game you might measure the number of wins, or the number of pointsHow to display the resultsUsually as a graphIt can be hard to look at a large table of numbers and figure out what it means3Sources of Test ProblemsInternational Planning Competitionhttp://ipc.icaps-conference.org/For most of the competitions, you can download the entire set of competition problemsUsually they’re written in a language called PDDLhttp://cs-www.cs.yale.edu/homes/dvm/Some PDDL tutorials:http://www.ida.liu.se/~TDDA13/labbar/planning/2003/writing.htmlhttp://www.cs.toronto.edu/~sheila/2542/w09/A1/introtopddl2.pdf http://www.cs.cmu.edu/~mmv/planning/homework/PDDL_Examples.pdf4The IPC test data isn’t always adequateOnly 20 problems total!Not clear whether the results are statistically meaningfulThe problems aren’t sorted by difficultyHard to see what the trends are5Preferably each data point should be an average of at least 20 problems of similar difficultyWhat does similar difficulty mean?Generally it means similar sizeSometimes there’s more than one possible measure of size (see below)Below, each data point is an average over 100 problemsThe curve is much smoother -- easier to see how the program is performingThese results have high statistical significance6Statistical significance?If each data point is an average of many runs, it’s possible to compute error bars95% confidence interval around each pointSome graphing programs can compute this for you automaticallyIn other areas of AI (e.g., machine learning), people do this a lotIn AI planning, almost nobody does it, and I won’t require it7How to get a large set of problemsIn some cases, you can download the program that generated the problems, and use it to generate more problemsIf you can’t find the program on the web, ask me and I’ll check to see whether I can find itIn other cases you can build such a program yourselfBut be careful how you do itImportant for the program to generate an unbiased random sampleOn a biased sample, a program’s performance can look much different than on an unbiased sample Many years ago, when one of my PhD students showed me his experimental results, his program appeared to be running in time O(n2)When I looked at how he was generating his problems, it turned out his program only generated very easy onesWhen we fixed that, the running time turned out to be exponential8Polynomial versus exponential?Which of these is growing the fastest?9Polynomial versus exponential?Which of these is growing the fastest?10Polynomial versus exponential?Which of these is growing the fastest?y = x2y = x3.5y = 2xy = 1.2x11What if a program can’t handle all the data?Sometimes a program can’t solve all of the problems at each data pointIt may run out of time or run out of memoryIn this case, you generally need to throw out that data pointDon’t just take an average over the problems that the program can solveThis biases the data – you’re only reporting the average on the easy problemsThat makes a program’sperformance look muchbetter than it really is122-hour time limit for each runIf all 100 runs took less than 2 hours each, we took the average time per runIf lots of runs failed to finish with 2 hours, we omitted those data pointsNo good way to get a good value for those data pointsIf at least 97 of the runs finished within 2 hourswe counted others failure as 2 hours each when computing the averagewe marked the data point with an asterisk, to tell the reader than in this case the program looks better than it actually isThis gave us data points that wereclose to being correctIf we had thrown them away,the graph wouldn’t havebeen as informative13What problem domains to use?Depends on what you’re trying to showFor a journal or conference paperIf you want to show that a program works well in lots of cases, then you generally want to compare its performance against other programs on several (at least 3?) domains that are significantly different from each otherFor the term project, results on a single domain are probably OKSelect a domain that illustrates a particular situation that you’re trying to investigate14If you have difficultyIf you’re trying to evaluate a program’s performance and you run into difficultye.g., there’s some reason why it wouldn’t be feasible to run your program on a large set of problemsCome discuss it with
View Full Document