Slide 1Problem StatementMotivationsApproachesStatic ApproachesMOSSSlide 7Slide 8Slide 9Slide 10Slide 11Slide 12AST based matchingDECKARD (ICSE 2007)DECKARDCFG matchingCFG matchingSemantic Based MatchedSemantic BasedSlide 20Wrap UpCS590 Z Matching Program VersionsXiangyu ZhangCS590ZProblem StatementSuppose a program P’ is created by modifying P. Determine the difference between P and P’. For an artifact c’ in P’, decide if c’ belongs to the difference, if not, find the correspondence of c’ in P.•Static mapping•Non-trivialName comparison?What if •Clone analysis, comparison checkingCS590ZMotivationsValidate compiler transformationsFacilitate regression testingReverse obfuscationInformation propagationDebuggingCode plagiarism detectionInformation AssuranceCS590ZApproachesStatic Approaches•Entity name based•String based (MOSS)•AST based (DECKARD)•CFG based (JDIFF)•PDG based (PDIFF)•Binary based (BMAT)•Log based (editor plugin, comparison checking)Dynamic Approaches (not today)CS590ZStatic ApproachesEntity name matching•Model a function/field as tuples•Coarse grained matchingString matching•Diff (CVS, Subservion)•Longest common subsequence (LCS)Available operations are addition and deletionMatched pairs can not cross one anotherPrograms are far more complicated than stringsCopy, paste, move•CP-Miner (scale to linux kernel clone detection)Frequent subsequence miningCS590ZMOSSCode plagiarism detection•It also handles other digital contents Challenges•White space (variable name)•Noise (“the”, “int i”);•Order scrambling (paragraph reorders)Problem statement•Given a set of documents, identify substring matches that satisfy two properties:If there is a substring match at least as long as the guarantee threshold t, then this match is detected;Do not detect any matches shorter than the noise threshold, k.CS590ZMOSSk-gram•A continuous substring of length kCS590ZMOSSIncremental hashing•Hashing strings of length k is expensive for large k.•“rolling” hash function The (i+1)th k-gram hash = F (the i th k-gram hash, …)CS590ZMOSSFingerprint selection•A subset of hash values•Our goals: find all matching substrings >t; ignore matchings <k)•One of every tth hash values•0 mod pCS590ZMOSSWinnowing•Observation: given a sequence of hashes h1,…hn, if n>t-k, then at least one of the hi must be chosen•Have a sliding window with size w=t-k+1•In each window select the minimum hash value, break ties by select the rightmost occurrence.CS590ZMOSSAlgorithm•Build an index mapping fingerprints to locations for all documents.•Each document is fingerprinted a second time and the selected fingerprints are looked up in the index; this gives the list of all matching fingerprints for each document.•Sort (d,d1,fx), (d, d2,fy) by the first two elements. •Matches between documents are rank-ordered by size (number of fingerprints)CS590ZMOSSAdvantages•Guarantee to detect any >t substring matchesLimitations•Minor edits fail MOSS.x= a*b + c vs. z= c + a*b•Insertion, deletionCS590ZAST based matching[YANG, 1991, Software Practice and Experience]•Given two functions, build the ASTs•Match the roots•If so, apply LCS to align subtrees•Continue recursivelyFragileCS590ZDECKARD (ICSE 2007)CS590ZDECKARDAdvantages•Scalability•Insensitive to minor structural changes such as reordering, insertion, deletionLimitations•Structural similarity only•Insertion that incurs structure change.CS590ZCFG matching Hammock graph (JDIFF ,ASE 2004)•Match classes by names•Match fields by types•Match methods by signatures•Match instruction in methods by hammock graphsA hammock is a single entry single exit subgraph of a CFG.CS590ZCFG matchingPros•OrthogonalCan be combined with other matching techniques•SimpleCons•Coarse grained matching onlyNot good at clone detection•In case of code transformationCS590ZSemantic Based MatchedUsing PDG (SAS’01)CS590ZSemantic BasedCS590ZSemantic BasedPros•Non-contiguous, intertwined, reordered•Insensitive to code transformations.Cons•ScalabilityPoints-to analysis•Starting from a matching pair seems to be a problemCS590ZWrap UpFor clone detection•Maybe structural / text similarity is a good ideaFor whole program matching / method matching with code transformations•Semantic based is more appropriateScalability •PDG < CFG | AST < STRING <
View Full Document