Code Transformations to Improve Memory ParallelismVijay S. Pai and Sarita AdvePresenter: Divya KumarObjectives●To introduce code transformation techniques which reduce memory stall time●Why read miss clustering?●Why unroll-and-jam?●What are the limiting factors of performance?●What is the impact on application execution time?Paper Outline●Introduction●Motivation for Read Miss Clustering●Analysis and Transformation Framework●Evaluation Methodology●Experimental Results●Conclusions and Future WorkRead Miss Clustering●Why do we need to overlap multiple read misses?–Read stall time are bottleneck in ILP-based processors–Read miss latencies are too long to overlap with other instruction types●Why read miss clustering?–Overlap independent read misses within a single instruction window●How do we implement it?–Unroll-and-jamRow-wise Traversalfor (...j<4;j++)for (...i<4;i++)...A[j,i]Spatial LocalityClusteringColumn-wise Traversal(Step One: Interchange)for (...i<4;j++)for (...j<4;i++)...A[j,i]Maximizes clusteringMinimizes cache localityStep TwoStrip-mine and Interchangefor (..jj +=N)for(..i++)for(j=jj;j<jj+N;j++){...A[j,i]}Step Three: Unroll and Jamfor(..j+=N)for (...i++){...A[j,i]...A[j+1,i].........A[j+N-1,i]}●Stop column-wise traversal when miss clustering resources are fully utilizedBenefits●Scalar Replacement●Inner-loop iteration count does not change●Clustering in postlude possibleAnalysis●Locality Analysis to determine leading references and inner-loop self-spatial locality (M. E. Wolf and M. S. Lam. A Data Locality Optimizing Algorithm. In Proc. of the Conf. on Programming Language Design and Implementation, 1991.)●Limitations to read miss parallelism–Cache-line dependences–Address dependences–Window constraintsCache-line Dependence●Memory operation B is cache-dependent on A if –A is a leading reference, and–miss on A brings in the data of BLeading References with Inner-loop self-spatial localityInner-Loop dependence distanceAddress Dependences●Memory operation B is address dependent on A –Result of A is used to compute address of BWindow Constraints●Limitation on the number of read misses which can be held simultaneously in the instruction windowBackground on Floating-Point Pipelining●Unroll-and-jam has been previously used to improve floating-point pipelining ●To fill the pipeline, unroll-and-jam must be applied until f >= lρ x ρFloating-point operations in the inner-most loopNumber of stages in the floating-point pipelineR / ςR – no. of nodes inthe inner-loop recurrenceς – no. of iterations totraverse the cycleMapping to Memory Parallelism●lρ – Maximum number of simultaneous outstanding misses supported by the processor●ρ = R / ζ–R: Leading references in the inner-loop recurrence–ζ: Number of iterations to traverse the cycle●What about f ?Dynamic inner-loop unrolling●Cm = W / (iLm)–Cm: No. of copies of m that contribute overlapped misses–W : Window size–i : No. of instructions in a loop body –Lm:: No. of iterations which share a cache lineMiss Patterns●We need miss patterns to determine which leading reference instances miss together●Regular Leading References–Arrays –Different leading references miss together ●Irregulars–All others!–Miss pattern is not analyzable–Overall Miss rate PmAnd finally we have f!!!●f = freg + firreg–freg = Σm Є RLR Cm–firreg= Σm Є ILR Cm x Pm●Unroll-and-jam is applied as much as possible while maintaining that f <= lρ x ρ●Step four: Re-compute f!–New leading references and increased iteration size–Leading references may become non-leading referencesResolving Window Constraints●Inter-iteration constraints–Occurs when independent read misses in W/i iterations do not fill the resources for memory parallelism–Unrolling inner-loop until resources are filled–Exposes independent misses to instruction scheduler●Intra-iteration window constraints–Instruction scheduler packs independent read miss references in loop body close to one anotherExperiment Methodology●Environment: Rice Simulator for ILP Multiprocessors●Benchmark: Latbench and five scientific applications●Evaluation:–Uniprocessor: 11-48%–Multiprocessor: 9-39%Conclusions●Strengths–Good performance–Interactions with software pre-fetching provides greater benefits●Weakness–Legality of transformations are not
View Full Document