Tuning StencilsStencil Code OverviewStencil ApplicationsNaïve Stencil CodeOur Stencil CodeCache-Based ArchitecturesAutotuningNaive CodeNaïveNUMA-AwareSlide 11Loop Unrolling/ReorderingSlide 13PaddingSlide 15Thread/Cache BlockingSlide 17Software PrefetchingSlide 19SIMDizationSlide 21Cache BypassSlide 23Collaborative ThreadingSlide 25Autotuning ResultsArchitecture ComparisonConclusionsAcknowledgementsP A R A L L E L C O M P U T I N G L A B O R A T O R YEECSElectrical Engineering andComputer SciencesBERKELEY PAR LABTuning StencilsKaushik DattaMicrosoft Site VisitApril 29, 2008EECSElectrical Engineering andComputer SciencesBERKELEY PAR LABStencil Code OverviewFor a given point, a stencil is a pre-determined set of nearest neighbors (possibly including itself)A stencil code updates every point in a regular grid with a constant weighted subset of its neighbors (“applying a stencil”)QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.2D Stencil 3D StencilEECSElectrical Engineering andComputer SciencesBERKELEY PAR LABStencil ApplicationsStencils are critical to many scientific applications:Diffusion, Electromagnetics, Computational Fluid DynamicsBoth uniform and adaptive block-structured meshesMany type of stencils1D, 2D, 3D meshesNumber of neighbors (5-pt, 7-pt, 9-pt, 27-pt,…)Gauss-Seidel (update in place) vs Jacobi iterations (2 meshes)Varying boundary conditions (constant vs. periodic)EECSElectrical Engineering andComputer SciencesBERKELEY PAR LABNaïve Stencil Codevoid stencil3d(double A[], double B[], int nx, int ny, int nz) {for all grid indices in x-dim { for all grid indices in y-dim { for all grid indices in z-dim { B[center] = S0* A[center] + S1*(A[top] + A[bottom] + A[left] + A[right] + A[front] + A[back]); } } }}EECSElectrical Engineering andComputer SciencesBERKELEY PAR LABOur Stencil CodeExecutes a 3D, 7-point, Jacobi iteration on a 2563 gridPerforms 8 flops (6 adds, 2 mults) per pointParallelization performed with pthreadsThread affinity: multithreading, then multicore, then multisocketFlop:Byte Ratio0.33 (write allocate architectures)0.5 (Ideal)EECSElectrical Engineering andComputer SciencesBERKELEY PAR LABCache-Based ArchitecturesIntel ClovertownSun Victoria FallsAMD BarcelonaQuickTime™ and a decom pressorare needed to see t his pic ture.QuickTime™ and a decom pressorare needed to see t his pic ture.667MHz DDR2 DIMMs10.66 GB/sQuickTime™ and a decompressorare needed to see this pict ure.2x64b memory controllersQuickTime™ and a decompressorare needed to see this pict ure.HyperTransportQuickTime™ and a decompressorare needed to see this pict ure.OpteronQuickTime™ and a decompressorare needed to see this pict ure.OpteronQuickTime™ and a decompressorare needed to see this pict ure.OpteronQuickTime™ and a decompressorare needed to see this pict ure.OpteronQuickTime™ and a decompressorare needed to see this pict ure.QuickTime™ and a decompressorare needed to see this pict ure.667MHz DDR2 DIMMs10.66 GB/sQuickTime™ and a decompressorare needed to see this pict ure.2x64b memory controllersQuickTime™ and a decompressorare needed to see this pict ure.OpteronQuickTime™ and a decompressorare needed to see this pict ure.OpteronQuickTime™ and a decompressorare needed to see this pict ure.OpteronQuickTime™ and a decompressorare needed to see this pict ure.OpteronQuickTime™ and a decompressorare needed to see this pict ure.512KBvictimQuickTime™ and a decompressorare needed to see this pict ure.512KBvictimQuickTime™ and a decompressorare needed to see this pict ure.512KBvictimQuickTime™ and a decompressorare needed to see this pict ure.512KBvictimQuickTime™ and a decompressorare needed to see this pict ure.512KBvictimQuickTime™ and a decompressorare needed to see this pict ure.512KBvictimQuickTime™ and a decompressorare needed to see this pict ure.512KBvictimQuickTime™ and a decompressorare needed to see this pict ure.512KBvictimQuickTime™ and a decompressorare needed to see this pict ure.2MB Shared quasi- victim (32 way)QuickTime™ and a decompressorare needed to see this pict ure.SRI / c rossbarQuickTime™ and a decompressorare needed to see this pict ure.2MB Shared quasi- victim (32 way)QuickTime™ and a decompressorare needed to see this pict ure.SRI / c rossbarQuickTime™ and a decompressorare needed to see this pict ure.HyperTransport4GB/s (each direction)EECSElectrical Engineering andComputer SciencesBERKELEY PAR LABAutotuningProvides a portable and effective method for tuningLimiting the search space:Searching the entire space is intractableInstead, we ordered the optimizations appropriately for a given platformTo find best parameters for a given optimization, performed exhaustive searchEach optimization was applied on top of all previous optimizationsIn general, can also use heuristics/models to prune search spaceEECSElectrical Engineering andComputer SciencesBERKELEY PAR LABNaive CodeNaïve code is a simple, threaded stencil kernelDomain partitioning performed only in least contiguous dimensionNo optimizations or tuning was performedxyz (unit-stride)EECSElectrical Engineering andComputer SciencesBERKELEY PAR LABNaïveIntel Clovertown AMD BarcelonaSun Victoria Falls01234567891 2 4 8GFlops/s01234567891 2 4 8GFlops/s01234567891 2 4 8 16GFlops/sNaiveEECSElectrical Engineering andComputer SciencesBERKELEY PAR LABNUMA-AwareIntel ClovertownSun Victoria FallsAMD BarcelonaQuickTime™ and a decompressorare needed to see this picture.QuickTime™ and a decompressorare needed to see this picture.667MHz DDR2 DIMMs10.66 GB/sQuickTime™ and a decompressorare needed to see this picture.2x64b memory controllersQuickTime™ and a decompressorare needed to see this picture.HyperTransportQuickTime™ and a decompressorare needed to see this picture.OpteronQuickTime™ and a decompressorare needed to see this picture.OpteronQuickTime™ and a decompressorare needed to see this picture.OpteronQuickTime™ and a decompressorare needed to see this picture.OpteronQuickTime™ and a decompressorare needed to see this picture.QuickTime™ and a decompressorare needed to see this picture.667MHz DDR2 DIMMs10.66
or
We will never post anything without your permission.
Don't have an account? Sign up