DOC PREVIEW
Tuning Stencils

This preview shows page 1-2-3-27-28-29 out of 29 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 29 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 29 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 29 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 29 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 29 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 29 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 29 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Tuning StencilsStencil Code OverviewStencil ApplicationsNaïve Stencil CodeOur Stencil CodeCache-Based ArchitecturesAutotuningNaive CodeNaïveNUMA-AwareSlide 11Loop Unrolling/ReorderingSlide 13PaddingSlide 15Thread/Cache BlockingSlide 17Software PrefetchingSlide 19SIMDizationSlide 21Cache BypassSlide 23Collaborative ThreadingSlide 25Autotuning ResultsArchitecture ComparisonConclusionsAcknowledgementsP A R A L L E L C O M P U T I N G L A B O R A T O R YEECSElectrical Engineering andComputer SciencesBERKELEY PAR LABTuning StencilsKaushik DattaMicrosoft Site VisitApril 29, 2008EECSElectrical Engineering andComputer SciencesBERKELEY PAR LABStencil Code OverviewFor a given point, a stencil is a pre-determined set of nearest neighbors (possibly including itself)A stencil code updates every point in a regular grid with a constant weighted subset of its neighbors (“applying a stencil”)QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.2D Stencil 3D StencilEECSElectrical Engineering andComputer SciencesBERKELEY PAR LABStencil ApplicationsStencils are critical to many scientific applications:Diffusion, Electromagnetics, Computational Fluid DynamicsBoth uniform and adaptive block-structured meshesMany type of stencils1D, 2D, 3D meshesNumber of neighbors (5-pt, 7-pt, 9-pt, 27-pt,…)Gauss-Seidel (update in place) vs Jacobi iterations (2 meshes)Varying boundary conditions (constant vs. periodic)EECSElectrical Engineering andComputer SciencesBERKELEY PAR LABNaïve Stencil Codevoid stencil3d(double A[], double B[], int nx, int ny, int nz) {for all grid indices in x-dim { for all grid indices in y-dim { for all grid indices in z-dim { B[center] = S0* A[center] + S1*(A[top] + A[bottom] + A[left] + A[right] + A[front] + A[back]); } } }}EECSElectrical Engineering andComputer SciencesBERKELEY PAR LABOur Stencil CodeExecutes a 3D, 7-point, Jacobi iteration on a 2563 gridPerforms 8 flops (6 adds, 2 mults) per pointParallelization performed with pthreadsThread affinity: multithreading, then multicore, then multisocketFlop:Byte Ratio0.33 (write allocate architectures)0.5 (Ideal)EECSElectrical Engineering andComputer SciencesBERKELEY PAR LABCache-Based ArchitecturesIntel ClovertownSun Victoria FallsAMD BarcelonaQuickTime™ and a decom pressorare needed to see t his pic ture.QuickTime™ and a decom pressorare needed to see t his pic ture.667MHz DDR2 DIMMs10.66 GB/sQuickTime™ and a decompressorare needed to see this pict ure.2x64b memory controllersQuickTime™ and a decompressorare needed to see this pict ure.HyperTransportQuickTime™ and a decompressorare needed to see this pict ure.OpteronQuickTime™ and a decompressorare needed to see this pict ure.OpteronQuickTime™ and a decompressorare needed to see this pict ure.OpteronQuickTime™ and a decompressorare needed to see this pict ure.OpteronQuickTime™ and a decompressorare needed to see this pict ure.QuickTime™ and a decompressorare needed to see this pict ure.667MHz DDR2 DIMMs10.66 GB/sQuickTime™ and a decompressorare needed to see this pict ure.2x64b memory controllersQuickTime™ and a decompressorare needed to see this pict ure.OpteronQuickTime™ and a decompressorare needed to see this pict ure.OpteronQuickTime™ and a decompressorare needed to see this pict ure.OpteronQuickTime™ and a decompressorare needed to see this pict ure.OpteronQuickTime™ and a decompressorare needed to see this pict ure.512KBvictimQuickTime™ and a decompressorare needed to see this pict ure.512KBvictimQuickTime™ and a decompressorare needed to see this pict ure.512KBvictimQuickTime™ and a decompressorare needed to see this pict ure.512KBvictimQuickTime™ and a decompressorare needed to see this pict ure.512KBvictimQuickTime™ and a decompressorare needed to see this pict ure.512KBvictimQuickTime™ and a decompressorare needed to see this pict ure.512KBvictimQuickTime™ and a decompressorare needed to see this pict ure.512KBvictimQuickTime™ and a decompressorare needed to see this pict ure.2MB Shared quasi- victim (32 way)QuickTime™ and a decompressorare needed to see this pict ure.SRI / c rossbarQuickTime™ and a decompressorare needed to see this pict ure.2MB Shared quasi- victim (32 way)QuickTime™ and a decompressorare needed to see this pict ure.SRI / c rossbarQuickTime™ and a decompressorare needed to see this pict ure.HyperTransport4GB/s (each direction)EECSElectrical Engineering andComputer SciencesBERKELEY PAR LABAutotuningProvides a portable and effective method for tuningLimiting the search space:Searching the entire space is intractableInstead, we ordered the optimizations appropriately for a given platformTo find best parameters for a given optimization, performed exhaustive searchEach optimization was applied on top of all previous optimizationsIn general, can also use heuristics/models to prune search spaceEECSElectrical Engineering andComputer SciencesBERKELEY PAR LABNaive CodeNaïve code is a simple, threaded stencil kernelDomain partitioning performed only in least contiguous dimensionNo optimizations or tuning was performedxyz (unit-stride)EECSElectrical Engineering andComputer SciencesBERKELEY PAR LABNaïveIntel Clovertown AMD BarcelonaSun Victoria Falls01234567891 2 4 8GFlops/s01234567891 2 4 8GFlops/s01234567891 2 4 8 16GFlops/sNaiveEECSElectrical Engineering andComputer SciencesBERKELEY PAR LABNUMA-AwareIntel ClovertownSun Victoria FallsAMD BarcelonaQuickTime™ and a decompressorare needed to see this picture.QuickTime™ and a decompressorare needed to see this picture.667MHz DDR2 DIMMs10.66 GB/sQuickTime™ and a decompressorare needed to see this picture.2x64b memory controllersQuickTime™ and a decompressorare needed to see this picture.HyperTransportQuickTime™ and a decompressorare needed to see this picture.OpteronQuickTime™ and a decompressorare needed to see this picture.OpteronQuickTime™ and a decompressorare needed to see this picture.OpteronQuickTime™ and a decompressorare needed to see this picture.OpteronQuickTime™ and a decompressorare needed to see this picture.QuickTime™ and a decompressorare needed to see this picture.667MHz DDR2 DIMMs10.66


Tuning Stencils

Download Tuning Stencils
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Tuning Stencils and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Tuning Stencils 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?