Berkeley COMPSCI C267 - Auto-tuning Memory Intensive Kernels for Multicore - D324921

Home> Schools> University of California, Berkeley> Computer Science (COMPSCI) > COMPSCI C267> Auto-tuning Memory Intensive Kernels for Multicore

DOC PREVIEW

Berkeley COMPSCI C267 - Auto-tuning Memory Intensive Kernels for Multicore

School name University of California, Berkeley

Course Compsci C267- Applications of Parallel Computers

Pages 104

This preview shows page 1-2-3-4-5-6-7-49-50-51-52-53-54-55-98-99-100-101-102-103-104 out of 104 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 104 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 104 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 104 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 104 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 104 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 104 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 104 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 104 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 104 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 104 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 104 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 104 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 104 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 104 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 104 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 104 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 104 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 104 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 104 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 104 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 104 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 104 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Auto-tuning Memory Intensive Kernels for MulticoreOutlineChallenges arising from Optimizing Single Thread PerformanceInstruction-Level ParallelismILP Example (1x1 BCSR)ILP Example (1x4 BCSR)ILP Example (4x1 BCSR)Data-level ParallelismMemory-Level Parallelism (1)Memory-Level Parallelism (2)Branch MispredictionCache SubtletiesArray padding ExampleNew Challenges Arising when Optimizing Multicore SMP PerformanceWhat are SMPs ?What is multicore ? What are multicore SMPs ?AffinityNUMA ChallengesImplicit allocation for NUMANew Cache ChallengesNew Memory ChallengesSynchronizationPerformance Modeling and Little’s LawSystem AbstractionLittle’s LawWhere’s the bottleneck?Arithmetic IntensityKernel Arithmetic Intensity and ArchitectureSlide 29Roofline Model Basic ConceptSlide 31Roofline Model computational ceilingsSlide 33Slide 34Roofline Model communication ceilingsSlide 36Slide 37Roofline Model computation + communication ceilingsRoofline Model locality wallsSlide 40Slide 41Slide 42Optimization CategorizationSlide 44Slide 45Slide 46Slide 47Slide 48Slide 49Slide 50Slide 51Slide 52Slide 53Auto-tuning?Multicore SMPs of InterestMulticore SMPs UsedMulticore SMPs Used (Conventional cache-based memory hierarchy)Multicore SMPs Used (local store-based memory hierarchy)Multicore SMPs Used (CMT = Chip-MultiThreading)Multicore SMPs Used (threads)Multicore SMPs Used (peak double precision flops)Multicore SMPs Used (total DRAM bandwidth)Multicore SMPs Used (Non-Uniform Memory Access - NUMA)Roofline Model for these multicore SMPsAuto-tuning Sparse Matrix-Vector Multiplication (SpMV)Sparse Matrix Vector MultiplicationThe Dataset (matrices)SpMV Performance (simple parallelization)NUMA for SpMVPrefetch for SpMVSpMV Performance (NUMA and Software Prefetching)ILP/DLP vs BandwidthSpMV Performance (Matrix Compression)Cache blocking for SpMVAuto-tuned SpMV Performance (cache and TLB blocking)Auto-tuned SpMV Performance (architecture specific optimizations)Auto-tuned SpMV Performance (max speedup)Slide 78Roofline model for SpMVRoofline model for SpMV (overlay arithmetic intensity)Roofline model for SpMV (out-of-the-box parallel)Roofline model for SpMV (NUMA & SW prefetch)Roofline model for SpMV (matrix compression)Slide 84Slide 85Auto-tuning Lattice-Boltzmann Magneto-Hydrodynamics (LBMHD)LBMHDLBMHD Performance (reference implementation)LBMHD Performance (lattice-aware array padding)VectorizationLBMHD Performance (vectorization)LBMHD Performance (architecture specific optimizations)Slide 93Roofline model for LBMHDRoofline model for LBMHD (overlay arithmetic intensity)Roofline model for LBMHD (out-of-the-box parallel performance)Roofline model for LBMHD (Padding, Vectorization, Unrolling, Reordering, …)Roofline model for LBMHD (SIMDization + cache bypass)Slide 99LBMHD Performance (Summary)SummarySlide 102AcknowledgementsQuestions ?FUTURE TECHNOLOGIES GROUP1Auto-tuning Memory Intensive Kernels for MulticoreSam Williams [email protected] TECHNOLOGIES GROUPOutline1. Challenges arising from Optimizing Single Thread Performance2. New Challenges Arising when Optimizing Multicore SMP Performance3. Performance Modeling and Little’s Law4. Multicore SMPs of Interest5. Auto-tuning Sparse Matrix-Vector Multiplication (SpMV)6. Auto-tuning Lattice-Boltzmann Magneto-Hydrodynamics (LBMHD)7. Summary2FUTURE TECHNOLOGIES GROUPChallenges arising fromOptimizing Single Thread Performance3FUTURE TECHNOLOGIES GROUPInstruction-Level ParallelismOn modern pipelined architectures, operations (like floating-point addition) have a latency of 4-6 cycles (until the result is ready). However, independent adds can be pipelined one after another.Although this increases the peak flop rate, one can only achieve peak flops on the condition that on any given cycle the program has >4 independent adds ready to execute.failing to do so will result in a >4x drop in performance.The problem is exacerbated by superscalar or VLIW architectures like POWER or Itanium.One must often reorganize kernels to express more instruction-level parallelism 4FUTURE TECHNOLOGIES GROUP1x1 Register BlockILP Example (1x1 BCSR)for(all rows){ y0 = 0.0; for(all tiles in this row){ y0+=V[i]*X[C[i]] } y[r] = y0;}Consider the core of SpMVNo ILP in the inner loopOOO can’t accelerate serial FMAsFMAFMAtime = 0FMAFMAtime = 4FMAFMAtime = 8FMAFMAtime = 12FMAFMAtime = 16FUTURE TECHNOLOGIES GROUPILP Example (1x4 BCSR)for(all rows){ y0 = 0.0; for(all tiles in this row){ y0+=V[i ]*X[C[i] ] y0+=V[i+1]*X[C[i]+1] y0+=V[i+2]*X[C[i]+2] y0+=V[i+3]*X[C[i]+3] } y[r] = y0;}What about 1x4 BCSR ?Still no ILP in the inner loopFMAs are still dependent on each otherFMAFMAtime = 0FMAFMAtime = 4FMAFMAtime = 8FMAFMAtime = 121x4 Register BlockFUTURE TECHNOLOGIES GROUPILP Example (4x1 BCSR)for(all rows){ y0 = 0.0;y1 = 0.0; y2 = 0.0;y3 = 0.0; for(all tiles in this row){ y0+=V[i ]*X[C[i]] y1+=V[i+1]*X[C[i]] y2+=V[i+2]*X[C[i]] y3+=V[i+3]*X[C[i]] } y[r+0] = y0; y[r+1] = y1; y[r+2] = y2; y[r+3] = y3;}What about 4x1 BCSR ?Updating 4 different rowsThe 4 FMAs are independentThus they can be pipelined.FMAFMAtime = 0FMAFMAtime = 1FMAFMAtime = 2FMAFMAtime = 34x1 Register BlockFMAFMAtime = 4FMAFMAtime = 5FMAFMAtime = 6FMAFMAtime = 7FUTURE TECHNOLOGIES GROUPData-level ParallelismDLP = apply the same operation to multiple independent operands.Today, rather than relying on superscalar issue, many architectures have adopted SIMD as an efficient means of boosting peak performance. (SSE, Double Hummer, AltiVec, Cell, GPUs, etc…)Typically these instructions operate on four single precision(or two double precision) numbers at a time. However, some are more GPUs(32), Larrabee(16), and AVX(8)Failing to use these instructions may cause a 2-32x drop in performanceUnfortunately, most compilers utterly fail to generate these instructions.8++++++++FUTURE TECHNOLOGIES GROUPMemory-Level Parallelism (1)Although caches may filter many memory requests, in HPC many memory references will still go all the way to DRAM.Memory latency (as measured in core cycles) grew by an order of magnitude in the 90’sToday, the latency of a memory operation can exceed 200 cycles (1 double every 80ns is unacceptably slow).Like ILP, we wish to pipeline requests to DRAMSeveral solutions exist todayHW stream prefetchersHW Multithreading (e.g. hyperthreading)SW line prefetchDMA9FUTURE

View Full Document