Tree Partition based Parallel Frequent Pattern

Home> Academic Documents> Tree Partition based Parallel Frequent Pattern

DOC PREVIEW

This preview shows page 1-2-3 out of 8 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 8 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 8 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 8 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 8 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Tree Partition based Parallel Frequent Pattern mining on SharedMemory SystemsDehao Chen1, Chunrong Lai2, Wei Hu2,WenGuang Chen1,Yimin Zhang2, and Weimin Zheng11Tsinghua University2Intel CorporationDept. of Computer Science Intel China Research CenterBeijing, China 100084 No.2 Kexueyuan South [email protected] Beijing, China 100080{cwg, zwm-dcs}@tsinghua.edu.cn {chunrong.lai,wei.hu,yimin.zhang}@intel.comAbstractIn this paper, we present a tree-partition algorithmfor parallel mining of frequent patterns. Our work isbased on FP-Growth algorithm, which is constituted oftree-building stage and mining stage. The main idea isto build only one FP-Tree in the memory, partition itinto several independent parts and distribute them todiﬀerent threads. A heuristic algorithm is devised tobalance the workload. Our algorithm can not only alle-viate the impact of locks during the tree-building stage,but also avoid the overhead that do great harm to themining stage. We present the experiments on diﬀerentkinds of datasets and compare the results with otherparallel approaches. The results suggest that our ap-proach has great advantage in eﬃciency, especially oncertain kinds of datasets. As the number of processorsincreases, our parallel algorithm shows good scalability.1. IntroductionAssociation rule mining searches for interesting re-lationships among items in a given data set. One ofthe most famous examples of association rule miningis the market basket problem, which was introduced in[1]. Since the invention of Apriori algorithm [1, 2], alot of algorithms have been devised to cope with theproblem of frequent pattern mining, which is the mosttime-consuming part of association rule mining pro-cess. However, as for extremely large datasets, thecurrently proposed frequent pattern mining algorithmsstill consume too much time. One solution is to designmore eﬃcient mining algorithms to reduce the repeatedI/O scans as well as to minimize the memory require-ment and calculating time. So algorithms like kDCI [3],FP-Growth [4], etc, are proposed. Another alternativesolution is to parallelize the algorithm.As the recent development of CMP and SMP ar-chitecture, the shared-memory parallel program is be-coming more and more popular. The shared mem-ory structure can provide an economic parallel solutionwith good eﬃciency and high scalability. The presentfrequent pattern mining algorithms can be divided intotwo categories: apriori-like algorithms, which are theimplementations of the classical apriori algorithm, andthe other ones, which are completely diﬀerent in struc-ture with the classical apriori algorithm. Among bothkinds of algorithms, FP-Growth [4] can achieve goodeﬃciency. It’s faster than any of the apriori-like al-gorithms and it only has to scan the whole databasetwice.This paper proposed an approach to parallelize theFP-Growth algorithm on shared-memory structures.Section 2 gives a brief introduction to FP-Growth Algo-rithm. Section 3 presents related parallelization workof FP-Growth. Section 4 and Section 5 introduced ourparallelization algorithm. Section 6 showed the experi-ment results as well as comparisons with other parallelalgorithms.2. FP-Growth AlgorithmFP-Growth algorithm is based on tree structures.The algorithm can be divided into two steps.1-4244-0054-6/06/$20.00 ©2006 IEEE2.1. Building FP-TreeScan the database twice. The ﬁrst scan gathers nec-essary information for the second scan to build the FP-Tree from each line of transaction. The algorithm isshown below:Algorithm 1 FP-tree construction.Input: A transaction database DB and a minimumsupport threshold ξ.Output: FP-tree, the frequent-pattern tree of DB.Method: The FP-tree is constructed as follows.1. Scan the transaction database DB once. CollectF, the set of frequent items, and the support ofeach frequent item. Sort F in support-descendingorder as FList, the list of frequent items.2. Create the root of an FP-tree, T , and label it as“null”. For each transaction Trans in DB, do thefollowing.Select the frequent items in Trans and sort them ac-cording to the order of FList. Let the sorted frequent-item list in Trans be [p|P], where p is the ﬁrst elementand P is the remaining list. Call inserttree([p|P ],T).The function inserttree([p|P ],T) is performed asfollows. If T has a child N such that N.itemname =p.itemname, then increment N’s count by 1; else createa new node N, with its count initialized to 1, its parentlink linked to T , and its nodelink linked to the nodeswith the same itemname via the node link structure.If P is nonempty, call inserttree(P, N) recursively.2.2. Mining from the F P-TreeIt’s an iterative procedure: each step produces aset of conditional pattern base and then calculated to-gether. The algorithm is shown below:Algorithm 2 FP-growth: Mining frequent patternswith FP-tree by pattern fragment growth.Input: A database DB, represented by FP-tree con-structed according to Algorithm 1 , and a mini-mum support threshold ξ.Output: The complete set of frequent patterns.Method: Call FPGrowth(FP tree, null), which isshown in Figure 1.3. Related WorkAmong those algorithms proposed to address theproblem of mining frequent patterns, one of the keyalgorithms, which seemed to be the most popular inmany applications for enumerating frequent itemsets,is the apriori algorithm [1, 2]. It’s the foundation ofmost known algorithms whether sequential or parallel.Salvatore et al. have proposed Direct Count & Inter-sect (kDCI) [3] as well as its parallel version (parDCI[5]). However, it requires at least 3 full database scans.FP-Growth [4], which was proposed by Han et al., cre-ates a relatively compact tree-structure that alleviatesthe multi-scan problem and improved the candidateitemset generation. The algorithm requires only 2 fulldatabase scans. Our approach presented in this pa-per was based on this idea. In spite of the signiﬁcanceof the frequent pattern mining, particularly the gen-eration of frequent itemsets, few advances have beenmade on parallelizing frequent pattern mining algo-rithms. Most of the work on parallelizing associationrule mining on Shared-memory Multi Processor archi-tecture was based on apriori-like algorithms [5, 6, 7].Osmar et al. proposed a parallel algorithm [8], whichbuild extra trees for each process. Each thread buildsits own FP-Tree from certain part of the database, cal-culates out the candidate pattern base from its ownFP-Tree and then merges the


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-3 out of 8 pages.

Please select your school