DOC PREVIEW
Comprehensive Incremental Mining Algorithms

This preview shows page 1-2-3-4 out of 12 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

CISpan: Comprehensive Incremental Mining Algorithms of ClosedSequential Patterns for Multi-Versional Software MiningDing Yuan†, Kyuhyung Lee†, Hong Cheng†, Gopal Krishna†, Zhenmin Li‡,Xiao Ma†, Yuanyuan Zhou†‡and Jiawei Han††University of Illinois at Urbana-Champaign, Urbana, Illinois, USA‡CleanMake Inc., Urbana, Illinois, USA{dyuan3, kyuhlee, hcheng3, gkrishn2, zli4, xiaoma2, yyzhou, hanj}@cs.uiuc.eduAbstractRecently, frequent sequential pattern mining algorithms havebeen widely used in software engineering field to minevarious source code or specification patterns. In practice,software evolves from one version to another in its life span.The effort of mining frequent sequential patterns acrossmultiple versions of a software can be substantially reducedby efficient incremental mining. This problem is challengingin this domain since the databases are usually updated in allkinds of manners including insertion, various modificationsas well as removal of sequences. Also, different mining toolsmay have various mining constraints, such as low minimumsupport. None of the existing work can be applied effectivelydue to various limitations of such work. For example, ourrecent work, IncSpan, failed solving the problem because itcould neither handle low minimum support nor removal ofsequences from database.In this paper, we propose a novel, comprehensiveincremental mining algorithm for frequent sequentialpattern, CISpan (Comprehensive Incremental SequentialPattern mining). CISpan supports both closed and completeincremental frequent sequence mining, with all kinds ofupdates to the database. Compared to IncSpan, CISpantolerates a wide range for minimum support threshold (aslow as 2). Our performance study shows that in addition tohandling more test cases on which IncSpan fails, CISpanoutperforms IncSpan in all test cases which IncSpan couldhandle, including various sequence length, number ofsequences, modification ratio, etc., with an average of 3.4times speedup. We also tested CISpan’s performance ondatabases transformed from 20 consecutive versions ofLinux Kernel source code. On average, CISpan outperformsthe non-incremental CloSpan by 42 times.Keywords: Incremental mining, Software Engineering,Cross Module Mining, Frequent pattern1 Introduction1.1 Motivation Frequent sequential pattern mining [16,13, 12, 15] is an important and active research topic indata mining with broad applications, including mining weblogs, customer shopping transaction analysis, and DNAsequences, etc.These years also saw an increasing trend of utilizing fre-quent pattern mining in source code mining [5, 7, 14, 1, 9, 6]and software specification mining [8]. These tools tokenizethe source code in certain ways into a sequence database rep-resentation, and mine the frequent patterns in order to extractvarious information such as copy-pasted code segments [5],API usage [14, 1], programming rules [6], etc. For example,CP-Miner [5] is a tool to effectively detect copy-pasted codesegments and copy-paste related bugs from source code. Itfirst tokenizes each statement of source code into an integer,and maps the code into a sequential database, where eachsequence corresponds to a basic block of the source code.By mining the closed frequent sequences with the minimumsupport, min sup, of 2, which suggests that the code seg-ment was at least repeated once, the tool detects all the copy-pasted code segments within the source code.In reality, database evolves in an incremental manner.For example, the latest major version of Linux Kernel,2.6.X series, consists of more than 150 versions in total,which would be transformed into more than 150 similardatabases. The difference between these databases couldinvolve removal, various kinds of modifications and insertionof sequences, corresponding to removal, modification andinsertion of code segments into the source code. If we mineeach updated database from scratch, the majority of time andspace would be spent on repetition of the previous miningprocess, which is something we should and could optimize.In our result, we will demonstrate that by using our CISpanalgorithm, we can reduce the mining time of the updatedversion database transformed from Linux Kernel source codeby 42 times on average.Because of the significance of this problem, manystudies have contributed to making sequential pattern min-ing incremental [4, 10, 17, 3]. Our recent work, Inc-Span [4], demonstrated it significantly outperforms the non-incremental sequential mining algorithm [12] and the previ-ously proposed incremental algothim [10], and thus is con-sidered the state-of-the-art of incremental sequential patternmining. When mining the original database, IncSpan notonly buffers the frequent sequences, but also sequences thatare semi-frequent, which are likely to become frequent inthe new version. Later when mining the new version, onlysequences with support over a certain threshold will likelybecome frequent out of un-buffered sequences.However, during our attempts to apply IncSpan intothe code mining tools, we found that IncSpan is neithergeneral enough to handle real life database evolution nor reallife mining requirement (for example, CP-Miner requiresminimum support threshold to be 2). In addition, theperformance of IncSpan is far from optimal. The limitationand cost of IncSpan can be summarized as follows.• Failure to handle removal of sequences. IncSpancould only handle insertion of sequences or appendingitems into the tail of each sequence. The databasesin real world would evolve in far more varied ways.For example, the source code database we encounteredwould involve removal, various kinds of modificationssuch as adding or removing items, as well as insertionof the sequences.• Difficulties for mining frequent sequences with lowminimum support. IncSpan buffers semi-frequentsequences, sequences with support between [µ ×min sup, min sup) where µ is the buffer ratio be-tween 0 and 1. When the min sup itself is small, Inc-Span would end up buffering huge amount of semi-frequent sequences, and significantly degrade the per-formance. In the CP-Miner context, the min sup is 2.In this extreme case, IncSpan will not be effective atall, since it has to buffer semi-frequent sequences withsupport 1, that are all the subsequences in the database(exponential of sequence number in the database).• Unable to mine closed sequences. In many context,we only care about closed sequences. Given the


Comprehensive Incremental Mining Algorithms

Download Comprehensive Incremental Mining Algorithms
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Comprehensive Incremental Mining Algorithms and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Comprehensive Incremental Mining Algorithms 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?