A Characterization of Data Mining Algorithms on a Modern Processor

Home> Academic Documents> A Characterization of Data Mining Algorithms on a Modern Processor

DOC PREVIEW

This preview shows page 1-2-15-16-31-32 out of 32 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 32 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 32 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 32 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 32 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 32 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 32 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 32 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

A Characterization of Data Mining Algorithms on a Modern ProcessorAmol Ghoting, Gregory Buehrer, and Srinivasan ParthasarathyData Mining Research Laboratory,The Ohio State UniversityDaehyun Kim, Anthony Nguyen, Yen-Kuang Chen, and Pradeep DubeyArchitecture Research Laboratory, Intel CorporationRoadmap• Motivation and Contributions• Algorithms under study• Performance characterization• Case study:– Improving performance of FP-Growth• Related work• ConclusionsMotivation• KDD applications constitute a rapidly growing segment of the commercial and scientific computing domains• Interactive process  response times• Memory and compute intensive• Modern architectures– Memory wall issues• Latency tolerating mechanisms – prefetching, SMT• Objective here is to characterize such applications on a modern architecture– Can we leverage above mechanisms effectively?Contributions• Specifically, we study– Performance and memory access behavior of eight data mining algorithms– Impact of processor technologies such as hardware pre-fetching and simultaneous multithreading (SMT)– How to leverage latency-tolerating mechanisms to improve performance of frequent pattern miningRoadmap• Motivation and Contributions• Algorithms under study• Performance characterization• Case study:– Improving performance of FP-Growth• Related work• ConclusionsAlgorithms under study (1)• Frequent itemset mining– Finds groups of items that co-occur frequently in a transactional data set– Example: “Item A and Item B are purchased together 90% of the time”• FPGrowth (FP-tree)• MAFIA (Tid-list as a bit vector)• Sequence mining– Discovers sets of items that are shared across time– Example: “70% of the customers who buy item A also buy item B within 1 month”• SPADE (Tid-list)Algorithms under study (2)• Graph mining– Finds frequent sub-graphs in a graph data set• FSG– Tid-list• Clustering– Partitions data points into groups or clusters such that intra-cluster distance in minimized and inter-cluster distance in maximized• kMeans and vClusterAlgorithms under study (3)• Outlier detection– Finds the top k points in a data set that are most different from the remaining points• ORCA• Decision tree induction– Learns a decision tree from a data set• C4.5Roadmap• Motivation and Contributions• Algorithms under study• Performance characterization• Case study:– Improving performance of FP-Growth• Related work• ConclusionsPerformance characterization• Setup– Intel P4 at 2.8GHz with HT technology– 1.5GB of main memory– 8KB L1 d-cache and 512 KB L2 u-cache– Intel VTune Performance Analyzers to collect performance characteristics of execution– Synthetic/Real datasets– All codes were obtained from the authorsOperation mix0.1560.1630.1540.0570.1660.1770.0740.136Branch operations / instruction0.3920.5170.3530.2670.6920.5930.4330.65Memory operations / instruction0.2730.0870.2070.2520.0150.0040.0010.001Floating point operations / instruction0.6070.7690.6200.6880.6250.6360.8220.56Integer ALU operations / instructionORCAC4.5vClusterkMeansFSGSPADEMAFIAFPGrowthCache and CPU performance• FPGrowth– Poor cache hit rates– Large number of DTLB misses per instruction– Poor data locality – Low ILP0.119CPU utilization0.000ITLB misses / instruction0.024DTLB misses / instruction0.135ST operations / instruction0.515LD operations / instruction0.03L2 LD misses / instruction0.430L2 LD hit rate0.891L1 LD hit rateCache and CPU performance• MAFIA– Has the highest CPU utilization of the considered workloads• Counting using bit-vectors is very efficient– Temporal locality can be improved• Note:– The search is not as efficient as FPGrowth0.446CPU utilization0.000ITLB misses / instruction0.000DTLB misses / instruction0.042ST operations / instruction0.391LD operations / instruction0.001L2 LD misses / instruction0.997L2 LD hit rate0.953L1 LD hit rateCache and CPU performance• SPADE– Temporal locality can be improved– Very poor CPU utilization• Tidlist joins are expensive0.146CPU utilization0.000ITLB misses / instruction0.012DTLB misses / instruction0.116ST operations / instruction0.538LD operations / instruction0.001L2 LD misses / instruction0.992L2 LD hit rate0.954L1 LD hit rateCache and CPU performance• FSG– Temporal locality can be improved– Very poor CPU utilization• Tidlist joins are expensive0.152CPU utilization0.000ITLB misses / instruction0.007DTLB misses / instruction0.160ST operations / instruction0.532LD operations / instruction0.002L2 LD misses / instruction0.985L2 LD hit rate0.963L1 LD hit rateCache and CPU performance• kMeans– Poor CPU utilization• FPU intensive0.244CPU utilization0.001ITLB misses / instruction0.001DTLB misses / instruction0.013ST operations / instruction0.254LD operations / instruction0.000L2 LD misses / instruction0.989L2 LD hit rate0.979L1 LD hit rateCache and CPU performance• vCluster– Poor data locality• Graph partitioning 0.322CPU utilization0.000ITLB misses / instruction0.001DTLB misses / instruction0.083ST operations / instruction0.279LD operations / instruction0.000L2 LD misses / instruction0.987L2 LD hit rate0.882L1 LD hit rateCache and CPU performance• C4.5– The sort routine has poor data locality• Cache-conscious sort?0.049CPU utilization0.000ITLB misses / instruction0.005DTLB misses / instruction0.131ST operations / instruction0.385LD operations / instruction0.031L2 LD misses / instruction0.969L2 LD hit rate0.60L1 LD hit rateCache and CPU performance• ORCA– Similar trends0.316CPU utilization0.000ITLB misses / instruction0.003DTLB misses / instruction0.057ST operations / instruction0.335LD operations / instruction0.000L2 LD misses / instruction0.993L2 LD hit rate0.970L1 LD hit rateImpact of hardware prefetching and SMT1.031.181.261.301.261.051.061.02Speedup due to SMT1.011.061.191.021.151.021.651.11Speedup due to hardware pre-fetchingORCAC4.5vClusterkMeansFSGSPADEMAFIAFPGrowth• Prefetching improves performance for MAFIA, significantly– AND operation on bit-vectors– Working set is larger than other frequent pattern mining workloads• SMT helps the FPU intensive workloads, as it is able to mask FPUlatency– Not easy to hide memory latencyCharacterization summary• Compute intensive– Integer ALU and FPU intensive• Memory intensive– Limits CPU utilization• Good spatial locality– Temporal locality can be


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-15-16-31-32 out of 32 pages.

Please select your school