1Improving Memory Hierarchy Performance for IrregularApplications Using Data and Computation ReorderingsBy: John Mellor-Crummey, David Whalley, Ken KennedyPresented By: Saeed AlaeiOutline Memory hierarchies and performance. Data and computation reordering in general. Regular vs. Irregular applications. Space filling curves: Hilbert curve Morton curve Data and computation reordering approaches. Experimental resultsMemory performance Gap between CPU speed and memory speed has increased rapidly. Multi-level memory hierarchies are used to address this memory access bottleneck. Irregular applications underutilize the memory hierarchies. Data and computation reordering can greatly help.Data reorderingfor(i=0;i<n;i++)for(j=0;j<n;j++)for(k=0;k<n;k++)C[i][j]+=A[i][k]*B[k][j];for(i=0;i<n;i++)for(j=0;j<n;j++)for(k=0;k<n;k++)C[i][j]+=A[i][k]*BT[j][k];Computation reorderingfor(i=0;i<n;i++)for(j=0;j<n;j++)for(k=0;k<n;k++)C[i][j]+=A[i][k]*B[k][j];for(i=0;i<n;i++)for(k=0;k<n;k++)for(j=0;j<n;j++)C[i][j]+=A[i][k]*B[k][j];Regular vs. Irregular Applications Regular applications: Static analysis and fixed access pattern. Techniques: Loop blocking. Data prefetching. Instances: Matrix multiplication/inversion.Irregular applications: Dynamic analysis, access pattern unknown until runtime. Proposed techniques: Dynamic data reordering before major computation phase. Computations reordering by blocking.2Irregular applications Poor spatial and temporal locality. Bandwidth and latency problems: Cache miss. TLB miss. Instances: N-Body simulation.N-BodySpace Filling Curve A space-filling curve for some finite space of d dimensions (d ≥ 2) is a continuous, non-smooth curve that passes arbitrarily close to every point. Each point in a d-dimensional space can be mapped to the nearest position along a 1-dimensional spacefilling curve by applying a sequence of bit-level logical operations to its d-dimensional coordinatesHilbert CurvesMorton CurvesX=1010101Y=1010010M=11001100011001Reordering Data Reordering Computation Reordering3Data Reordering Approaches Changes the order of the elements of data. Doesn’t change the order in which the elements. are referenced Approaches: First touch reordering. Space filling curve reordering.First Touch Data Reordering It is a greedy approach to improve spatial localities in irregular applications. Changes the order of particle information Improves the locality in time by reordering the particle list.First Touch Reordering (cont’d) Advantages: Simple to implement. Requires linear time. Disadvantages: Interaction list must be known in advance.Space Filling Curve Data Reordering Steps: Particle coordinates must be mapped to positions in space filling curve. Particles are sorted in ascending order by their position on thecurve. Advantages: Can be done prior to knowing the order of computation. Allows some computation reordering (if computed before the interaction list). Disadvantages: May require more overhead due to sorting.Space Filling Curve Data Reordering (cont’d) Increases the spatial locality.Computation Reordering Changes the order in which the elements. are referenced. Doesn’t Changes the order of the elements of data. Can improve both temporal and spatial locality of access. Approaches: Space filling curve reordering. Reordering by blocking.4Space Filling Curve Computation Reordering Changes the order of elements in the interaction list. The order of particle information remains unchanged. Improves temporal locality.Computation Reordering by BlockingComputation Reordering by Blocking (cont’d)01 2345PPPPPPPPxPPPPzPPyPPP0123 Blocking can also be done base on different approaches (e.g. by taking the address).Computation Reordering by Blocking (cont’d) Blocking can be extended to multiple levels. A blocking factor depends on architectural characteristics: Size of cache lines or page size for TLB. The number of sets in the cache. Associativity. Replacement policy. …Computation Reordering by Blocking (cont’d) The recursive divide and conquer approach can be used to achieve a machine independent blocking. Computation Reordering by Blocking (cont’d) For each pair [Pi, Pj] in the interaction list, we compute the interleaving of [block_of(Pi) , block_of(Pj)]. This is equivalent to using a Morton curve. A Hilbert curve can also be used. Both Hilbert and Morton curve give a recursive blocking.5Applying The Techniques Particle codes: Moldyn (a synthetic benchmark). Magi CHADMoldyn A synthetic benchmark for molecular dynamics simulation. Closely resembles the N-Body example. The time consuming part is the computeForcesfunction called on list of interactions. The experiment: Performed SGI O2 based on R10000 MIPS. 256,000 particles, over 27 million interactions.Moldyn – base line programBlock NumberInteraction NumberMoldyn – Hilbert CurveBlock NumberInteraction NumberMoldyn – Multi-level – first 10KBlock NumberInteraction NumberMoldyn – Multi-level – first 100KBlock NumberInteraction Number6Moldyn – Multi-level – first 1MBlock NumberInteraction NumberMoldyn – Various ReorderingsMagi An application used by the U.S. Air Force to perform hydrodynamic computation. Description:Magi – Various ReorderingsCHAD A Parallel unstructured mesh application for simulating 3D fluid flows with chemical reactions and fuel sprays. Contains mostly dense vector operations. The results indicate that the original ordering is already
View Full Document