CS590Z Statistical DebuggingA Very Important PrincipleTarantula (ASE 2005, ISSTA 2007)Scalable Remote Bug Isolation (PLDI 2004, 2005)Bug IsolationSlide 6An ExampleSlide 8Program PredicatesEvaluation Bias of Predicate PUnderlying PopulationsMajor ChallengesOur ApproachAlgorithm OutputsOutlineExperiment ResultsBug BenchmarkLocalization Quality Metric [RR03]1st Example2nd ExampleRelated WorksLocalized bugs w.r.t. Examined CodeCumulative Effects w.r.t. Code ExaminationTop-k SelectionSlide 25Case Study: bc 1.06Slide 27Future WorkConclusionsDiscussionCS590Z Statistical DebuggingXiangyu Zhang (part of the slides are from Chao Liu)A Very Important PrincipleTraditional debugging techniques deal with single (or very few) executions.With the acquisition of a large set of executions, including passing and failing executions, statistical debugging is often highly effective.Failure reportingIn house testingTarantula (ASE 2005, ISSTA 2007)Scalable Remote Bug Isolation (PLDI 2004, 2005)Look at predicatesBranchesFunction returns (<0, <=0, >0, >=0, ==0, !=0)Scalar pairsFor each assignment x=…, find all variables y_i and constants c_j, each pair of x (=,<,<=…) y_i/c_jSample the predicate evaluations (Bernoulli sampling)Investigate the relation of the probability of a predicate being true with the bug manifestion.Bug IsolationBug Isolation How much does P being true increase the probability of failure over simply reaching the line P is sampled.An ExampleSymptoms563 lines of C code130 out of 5542 test cases fail to give correct outputsNo crashesThe predicate are evaluated to both true and false in one executionvoid subline(char *lin, char *pat, char *sub){ int i, lastm, m; lastm = -1; i = 0; while((lin[i] != ENDSTR)) {m = amatch(lin, i, pat, 0);if (m >= 0){ putsub(lin, i, m, sub); lastm = m; }if ((m == -1) || (m == i)){ fputc(lin[i], stdout); i = i + 1;} else i = m; }}void subline(char *lin, char *pat, char *sub){ int i, lastm, m; lastm = -1; i = 0; while((lin[i] != ENDSTR)) {m = amatch(lin, i, pat, 0);if ((m >= 0) && (lastm != m) ){ putsub(lin, i, m, sub); lastm = m; }if ((m == -1) || (m == i)){ fputc(lin[i], stdout); i = i + 1;} else i = m; }}not enoughvoid subline(char *lin, char *pat, char *sub){ int i, lastm, m; lastm = -1; i = 0; while((lin[i] != ENDSTR)) {m = amatch(lin, i, pat, 0);if ((m >= 0) && (lastm != m) ){ putsub(lin, i, m, sub); lastm = m; }if ((m == -1) || (m == i)){ fputc(lin[i], stdout); i = i + 1;} else i = m; }}P_f (A) = tilde P (A | A & !B)P_t (A) = tilde P (A | !(A&!B))Program PredicatesA predicate is a proposition about any program propertiese.g., idx < BUFSIZE, a + b == c, foo() > 0 … Each can be evaluated multiple times during one executionEvery evaluation gives either true or falseTherefore, a predicate is simply a boolean random variable, which encodes program executions from a particular aspect.Evaluation Bias of Predicate PEvaluation biasDef’n: the probability of being evaluated as true within one executionMaximum likelihood estimation: Number of true evaluations over the total number of evaluations in one runEach run gives one observation of evaluation bias for predicate PSuppose we have n correct and m incorrect executions, for any predicate P, we end up withAn observation sequence for correct runsS_p = (X’_1, X’_2, …, X’_n)An observation sequence for incorrect runsS_f = (X_1, X_2, …, X_m)Can we infer whether P is suspicious based on S_p and S_f?Underlying PopulationsImagine the underlying distribution of evaluation bias for correct and incorrect executions are and S_p and S_f can be viewed as a random sample from the underlying populations respectivelyOne major heuristic is The larger the divergence between and , the more relevant the predicate P is to the bug0 1ProbEvaluation bias0 1ProbEvaluation biasMajor ChallengesNo knowledge of the closed forms of both distributionsUsually, we do not have sufficient incorrect executions to estimate reliably. 0 1ProbEvaluation bias0 1ProbEvaluation biasOur ApproachAlgorithm OutputsA ranked list of program predicates w.r.t. the bug relevance score s(P)Higher-ranked predicates are regarded more relevant to the bugWhat’s the use?Top-ranked predicates suggest the possible buggy regionsSeveral predicates may point to the same region… …OutlineProgram PredicatesPredicate RankingsExperimental ResultsCase Study: bc-1.06Future WorkConclusionsExperiment ResultsLocalization quality metricSoftware bug benchmarkQuantitative metricRelated worksCause Transition (CT), [CZ05] Statistical Debugging, [LN+05] Performance comparisonsBug BenchmarkBug benchmarkDreaming benchmarkLarge number of known bugs on large-scale programs with adequate test suiteSiemens Program Suite130 variants of 7 subject programs, each of 100-600 LOC130 known bugs in totalmainly logic (or semantic) bugsAdvantagesKnown bugs, thus judgments are objectiveLarge number of bugs, thus comparative study is statistically significant.DisadvantagesSmall-scaled subject programsState-of-the-art performance, so far claimed in literature, Cause-transition approach, [CZ05]Localization Quality Metric [RR03]1st Example12354961087T-score = 70%2nd Example1237496105T-score = 20%8Related WorksCause Transition (CT) approach [CZ05]A variant of delta debugging [Z02]Previous state-of-the-art performance holder on Siemens suitePublished in ICSE’05, May 15, 2005Cons: it relies on memory abnormality, hence its performance is restricted.Statistical Debugging (Liblit05) [LN+05]Predicate ranking based on discriminant analysisPublished in PLDI’05, June 12, 2005Cons: Ignores evaluation patterns of predicates within each executionLocalized bugs w.r.t. Examined CodeCumulative Effects w.r.t. Code ExaminationTop-k SelectionRegardless of specific selection of k, both Liblit05 and SOBER are better than CT, the current state-of-the-art holderFrom k=2 to 10, SOBER is better than Liblit05 consistentlyOutlineEvaluation Bias of PredicatesPredicate RankingsExperimental ResultsCase Study: bc-1.06Future WorkConclusionsCase Study: bc 1.06bc 1.0614288 LOCAn arbitrary-precision
View Full Document