DOC PREVIEW
Tree Partition based Parallel Frequent Pattern

This preview shows page 1-2-3 out of 8 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Tree Partition based Parallel Frequent Pattern mining on SharedMemory SystemsDehao Chen1, Chunrong Lai2, Wei Hu2,WenGuang Chen1,Yimin Zhang2, and Weimin Zheng11Tsinghua University2Intel CorporationDept. of Computer Science Intel China Research CenterBeijing, China 100084 No.2 Kexueyuan South [email protected] Beijing, China 100080{cwg, zwm-dcs}@tsinghua.edu.cn {chunrong.lai,wei.hu,yimin.zhang}@intel.comAbstractIn this paper, we present a tree-partition algorithmfor parallel mining of frequent patterns. Our work isbased on FP-Growth algorithm, which is constituted oftree-building stage and mining stage. The main idea isto build only one FP-Tree in the memory, partition itinto several independent parts and distribute them todifferent threads. A heuristic algorithm is devised tobalance the workload. Our algorithm can not only alle-viate the impact of locks during the tree-building stage,but also avoid the overhead that do great harm to themining stage. We present the experiments on differentkinds of datasets and compare the results with otherparallel approaches. The results suggest that our ap-proach has great advantage in efficiency, especially oncertain kinds of datasets. As the number of processorsincreases, our parallel algorithm shows good scalability.1. IntroductionAssociation rule mining searches for interesting re-lationships among items in a given data set. One ofthe most famous examples of association rule miningis the market basket problem, which was introduced in[1]. Since the invention of Apriori algorithm [1, 2], alot of algorithms have been devised to cope with theproblem of frequent pattern mining, which is the mosttime-consuming part of association rule mining pro-cess. However, as for extremely large datasets, thecurrently proposed frequent pattern mining algorithmsstill consume too much time. One solution is to designmore efficient mining algorithms to reduce the repeatedI/O scans as well as to minimize the memory require-ment and calculating time. So algorithms like kDCI [3],FP-Growth [4], etc, are proposed. Another alternativesolution is to parallelize the algorithm.As the recent development of CMP and SMP ar-chitecture, the shared-memory parallel program is be-coming more and more popular. The shared mem-ory structure can provide an economic parallel solutionwith good efficiency and high scalability. The presentfrequent pattern mining algorithms can be divided intotwo categories: apriori-like algorithms, which are theimplementations of the classical apriori algorithm, andthe other ones, which are completely different in struc-ture with the classical apriori algorithm. Among bothkinds of algorithms, FP-Growth [4] can achieve goodefficiency. It’s faster than any of the apriori-like al-gorithms and it only has to scan the whole databasetwice.This paper proposed an approach to parallelize theFP-Growth algorithm on shared-memory structures.Section 2 gives a brief introduction to FP-Growth Algo-rithm. Section 3 presents related parallelization workof FP-Growth. Section 4 and Section 5 introduced ourparallelization algorithm. Section 6 showed the experi-ment results as well as comparisons with other parallelalgorithms.2. FP-Growth AlgorithmFP-Growth algorithm is based on tree structures.The algorithm can be divided into two steps.1-4244-0054-6/06/$20.00 ©2006 IEEE2.1. Building FP-TreeScan the database twice. The first scan gathers nec-essary information for the second scan to build the FP-Tree from each line of transaction. The algorithm isshown below:Algorithm 1 FP-tree construction.Input: A transaction database DB and a minimumsupport threshold ξ.Output: FP-tree, the frequent-pattern tree of DB.Method: The FP-tree is constructed as follows.1. Scan the transaction database DB once. CollectF, the set of frequent items, and the support ofeach frequent item. Sort F in support-descendingorder as FList, the list of frequent items.2. Create the root of an FP-tree, T , and label it as“null”. For each transaction Trans in DB, do thefollowing.Select the frequent items in Trans and sort them ac-cording to the order of FList. Let the sorted frequent-item list in Trans be [p|P], where p is the first elementand P is the remaining list. Call inserttree([p|P ],T).The function inserttree([p|P ],T) is performed asfollows. If T has a child N such that N.itemname =p.itemname, then increment N’s count by 1; else createa new node N, with its count initialized to 1, its parentlink linked to T , and its nodelink linked to the nodeswith the same itemname via the node link structure.If P is nonempty, call inserttree(P, N) recursively.2.2. Mining from the F P-TreeIt’s an iterative procedure: each step produces aset of conditional pattern base and then calculated to-gether. The algorithm is shown below:Algorithm 2 FP-growth: Mining frequent patternswith FP-tree by pattern fragment growth.Input: A database DB, represented by FP-tree con-structed according to Algorithm 1 , and a mini-mum support threshold ξ.Output: The complete set of frequent patterns.Method: Call FPGrowth(FP tree, null), which isshown in Figure 1.3. Related WorkAmong those algorithms proposed to address theproblem of mining frequent patterns, one of the keyalgorithms, which seemed to be the most popular inmany applications for enumerating frequent itemsets,is the apriori algorithm [1, 2]. It’s the foundation ofmost known algorithms whether sequential or parallel.Salvatore et al. have proposed Direct Count & Inter-sect (kDCI) [3] as well as its parallel version (parDCI[5]). However, it requires at least 3 full database scans.FP-Growth [4], which was proposed by Han et al., cre-ates a relatively compact tree-structure that alleviatesthe multi-scan problem and improved the candidateitemset generation. The algorithm requires only 2 fulldatabase scans. Our approach presented in this pa-per was based on this idea. In spite of the significanceof the frequent pattern mining, particularly the gen-eration of frequent itemsets, few advances have beenmade on parallelizing frequent pattern mining algo-rithms. Most of the work on parallelizing associationrule mining on Shared-memory Multi Processor archi-tecture was based on apriori-like algorithms [5, 6, 7].Osmar et al. proposed a parallel algorithm [8], whichbuild extra trees for each process. Each thread buildsits own FP-Tree from certain part of the database, cal-culates out the candidate pattern base from its ownFP-Tree and then merges the


Tree Partition based Parallel Frequent Pattern

Download Tree Partition based Parallel Frequent Pattern
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Tree Partition based Parallel Frequent Pattern and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Tree Partition based Parallel Frequent Pattern 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?