DOC PREVIEW
UCSD CSE 231 - Code Transformations

This preview shows page 1-2-20-21 out of 21 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 21 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 21 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 21 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 21 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 21 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Code Transformations to Improve Memory ParallelismVijay S. Pai and Sarita AdvePresenter: Divya KumarObjectives●To introduce code transformation techniques which reduce memory stall time●Why read miss clustering?●Why unroll-and-jam?●What are the limiting factors of performance?●What is the impact on application execution time?Paper Outline●Introduction●Motivation for Read Miss Clustering●Analysis and Transformation Framework●Evaluation Methodology●Experimental Results●Conclusions and Future WorkRead Miss Clustering●Why do we need to overlap multiple read misses?–Read stall time are bottleneck in ILP-based processors–Read miss latencies are too long to overlap with other instruction types●Why read miss clustering?–Overlap independent read misses within a single instruction window●How do we implement it?–Unroll-and-jamRow-wise Traversalfor (...j<4;j++)for (...i<4;i++)...A[j,i]Spatial LocalityClusteringColumn-wise Traversal(Step One: Interchange)for (...i<4;j++)for (...j<4;i++)...A[j,i]Maximizes clusteringMinimizes cache localityStep TwoStrip-mine and Interchangefor (..jj +=N)for(..i++)for(j=jj;j<jj+N;j++){...A[j,i]}Step Three: Unroll and Jamfor(..j+=N)for (...i++){...A[j,i]...A[j+1,i].........A[j+N-1,i]}●Stop column-wise traversal when miss clustering resources are fully utilizedBenefits●Scalar Replacement●Inner-loop iteration count does not change●Clustering in postlude possibleAnalysis●Locality Analysis to determine leading references and inner-loop self-spatial locality (M. E. Wolf and M. S. Lam. A Data Locality Optimizing Algorithm. In Proc. of the Conf. on Programming Language Design and Implementation, 1991.)●Limitations to read miss parallelism–Cache-line dependences–Address dependences–Window constraintsCache-line Dependence●Memory operation B is cache-dependent on A if –A is a leading reference, and–miss on A brings in the data of BLeading References with Inner-loop self-spatial localityInner-Loop dependence distanceAddress Dependences●Memory operation B is address dependent on A –Result of A is used to compute address of BWindow Constraints●Limitation on the number of read misses which can be held simultaneously in the instruction windowBackground on Floating-Point Pipelining●Unroll-and-jam has been previously used to improve floating-point pipelining ●To fill the pipeline, unroll-and-jam must be applied until f >= lρ x ρFloating-point operations in the inner-most loopNumber of stages in the floating-point pipelineR / ςR – no. of nodes inthe inner-loop recurrenceς – no. of iterations totraverse the cycleMapping to Memory Parallelism●lρ – Maximum number of simultaneous outstanding misses supported by the processor●ρ = R / ζ–R: Leading references in the inner-loop recurrence–ζ: Number of iterations to traverse the cycle●What about f ?Dynamic inner-loop unrolling●Cm = W / (iLm)–Cm: No. of copies of m that contribute overlapped misses–W : Window size–i : No. of instructions in a loop body –Lm:: No. of iterations which share a cache lineMiss Patterns●We need miss patterns to determine which leading reference instances miss together●Regular Leading References–Arrays –Different leading references miss together ●Irregulars–All others!–Miss pattern is not analyzable–Overall Miss rate PmAnd finally we have f!!!●f = freg + firreg–freg = Σm Є RLR Cm–firreg= Σm Є ILR Cm x Pm●Unroll-and-jam is applied as much as possible while maintaining that f <= lρ x ρ●Step four: Re-compute f!–New leading references and increased iteration size–Leading references may become non-leading referencesResolving Window Constraints●Inter-iteration constraints–Occurs when independent read misses in W/i iterations do not fill the resources for memory parallelism–Unrolling inner-loop until resources are filled–Exposes independent misses to instruction scheduler●Intra-iteration window constraints–Instruction scheduler packs independent read miss references in loop body close to one anotherExperiment Methodology●Environment: Rice Simulator for ILP Multiprocessors●Benchmark: Latbench and five scientific applications●Evaluation:–Uniprocessor: 11-48%–Multiprocessor: 9-39%Conclusions●Strengths–Good performance–Interactions with software pre-fetching provides greater benefits●Weakness–Legality of transformations are not


View Full Document

UCSD CSE 231 - Code Transformations

Download Code Transformations
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Code Transformations and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Code Transformations 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?