Experiment Design for Computer ScientistsSourcesExperiment designProvable ClaimsSlide 5More Provable ClaimsOne MoreMeasurable, Meaningful CriteriaMeasurable CriteriaMeaningful CriteriaExample 1: CISCExample 2: MYCINMYCIN Study 2MYCIN Study 3MYCIN ResultsMYCIN Lessons LearnedReasonable BaselinesBaseline: Point of ComparisonPoor BaselinesEstablish a NeedTest Alternative ExplanationsIs CHC Better than Random HC ?Statistically Valid ResultsLook at Your DataAnscombe Datasets PlottedLook at Your Data, AgainCloser analysis reveals…Statistical MethodsSlide 29September1999October 1999October 1999Experiment Design for Computer ScientistsMarie desJardins ([email protected])CMSC 691BMarch 9, 2004September1999October 1999October 19993/9/04 2SourcesPaul Cohen, Empirical Methods in Artificial Intelligence, MIT Press, 1995.Tom Dietterich, CS 591 class slides, Oregon State University.Rob Holte, “Experimental Methodology,” presented at the ICML 2003 Minitutorial on Research, ‘Riting, and Reviews.September1999October 1999October 19993/9/04 3Experiment designExperiment design criteria:Claims should be provableContributing factors should be isolated and controlled forEvaluation criteria should be measurable and meaningfulData should be gathered on convincing domain /problem Baselines should be reasonableResults should be shown to be statistically validssSeptember1999October 1999October 1999Provable ClaimsSeptember1999October 1999October 19993/9/04 5Provable ClaimsMany research goals start out vague:Build a better plannerLearn preference functionsEventually, these claims need to be made provable:ConcreteQuantitativeMeasurableProvable claims:My planner can solve large, real-world planning problems under conditions of uncertainty, in polynomial time, with few execution-time repairs.My learning system can learn to rank objects, producing rankings that are consistent with user preferences, measured by probability of retrieving desired objects.September1999October 1999October 19993/9/04 6More Provable Claims More vague claims:Render painterly drawingsDesign a better interfaceProvable claims:My system can convert input images into drawings in the style of Matisse, with high user approval, and with measurably similar characteristics to actual Matisse drawings (color, texture, and contrast distributions).My interface can be learned by novice users in less time than it takes to learn Matlab; task performance has equal quality, but takes significantly less time than using Matlab.September1999October 1999October 19993/9/04 7One MoreVague claim:Visualize relational dataProvable claim:My system can load and draw layouts for relational datasets of up to 2M items in less than 5 seconds; the resulting drawings exhibit efficient screen utilization and few edge crossings; and users are able to manually infer important relationships in less time than when viewing the same datasets with MicroViz.September1999October 1999October 1999Measurable, Meaningful CriteriaSeptember1999October 1999October 19993/9/04 9Measurable CriteriaIdeally, your evaluation criteria should be:Easy to measureReliable (i.e., replicable)Valid (i.e., measuring the right thing)Applicable early in the design processConvincingTypical criteria:CPU time / clock timeCycles per instructionNumber of [iterations, search states, disk seeks, ...]Percentage of correct classificationNumber of [interface flaws, user interventions, necessary modifications, ...]Adapted with permission from Tom Dietterich’s CS 519 (Oregon State University) course slidesSeptember1999October 1999October 19993/9/04 10Meaningful CriteriaEvaluation criteria must address the claim you are trying to makeNeed clear relationship between the claim/goals and the evaluation criteriaGood criteria:Your system scores well iff it meets your stated goalBad criteria:Your system can score well even though it doesn’t meet the stated goalYour system can score badly even though it does meet the stated goalSeptember1999October 1999October 19993/9/04 11Example 1: CISCTrue goals:Efficiency (low instruction fetch, page faults)Cost-effectiveness (low memory cost)Ease of programmingEarly metrics:Code size (in bytes)Entropy of Op-code fieldOrthogonality (can all modes be combined?)Efficient execution of the resulting programs was not being directly consideredRISC showed that the connection between the criteria and the true goals was no longer strong→ Metrics not appropriate! Adapted with permission from Tom Dietterich’s CS 519 (Oregon State University) course slidesSeptember1999October 1999October 19993/9/04 12Example 2: MYCINMYCIN: Expert system for diagnosing bacterial infections in the bloodStudy 1 evaluation criteria were:Expert ratings of program tracesDid the patient need treatment?Were the isolated organisms significant?Was the system able to select an appropriate therapy?What was the overall quality of MYCIN’s diagnosis?Problems:Overly subjective dataAssumed that experts were ideal diagnosticiansExperts may have been biased against the computerRequired too much expert timeLimited set of experts (all from Stanford Hospital)Adapted with permission from Tom Dietterich’s CS 519 (Oregon State University) course slidesSeptember1999October 1999October 19993/9/04 13MYCIN Study 2Evaluation criteria:Expert ratings of treatment planMultiple-choice rating system of MYCIN recommendationsExperts from several different hospitalsComparison to study 1: Objective ratings More diverse experts Still have assumption that experts are right Still have possible anti-computer bias Still takes a lot of timeAdapted with permission from Tom Dietterich’s CS 519 (Oregon State University) course slidesSeptember1999October 1999October 19993/9/04 14MYCIN Study 3Evaluation criteria:Multiple-choice ratings in a blind evaluation setting:MYCIN recommendationsNovice recommendationsIntermediate recommendationsExpert recommendationsComparison to study 2: No more anti-computer bias Still assumes expert ratings are correct Still time-consuming (maybe even more so!)Adapted with permission from Tom Dietterich’s CS 519 (Oregon State University) course slidesSeptember1999October 1999October 19993/9/04 15MYCIN
View Full Document