DOC PREVIEW
UT CS 395T - A Comparison of Cache-conscious and Cache-oblivious Programs

This preview shows page 1-2-3 out of 8 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

1A Comparison of Cache-conscious and Cache-oblivious ProgramsKeshav Pingali, University of Texas, AustinJoint work with Kamen Yotov, Goldman SachsTom Roeder, Cornell UniversityJohn Gunnels, IBM T.J. Watson Research CenterFred Gustavson, IBM T.J.Watson Research CenterMemory Hierarchy Management• Cache-conscious (CC) approach:– Blocked iterative algorithms and arrays (usually)• Code and data structures have parameters that depend on careful blocking for memory hierarchy– Used in dense linear algebra libraries: BLAS, LAPACK• Lots of algorithmic data reuse: O(N3) operations on O(N2) data• Cache-oblivious (CO) approach:– Recursive algorithms and data structures (usually)• Not aware of memory hierarchy: approximate blocking• I/O optimal: Hong and Kung, Frigo and Leiserson– Used in FFT implementations: FFTW • Little algorithmic data reuse: O(N(logN)) computations on O(N) dataQuestions• Does CO approach perform as well as CC approach?– Intuitively, a “self-adaptive” program that is oblivious of some hardware feature should be at a disadvantage.– Little experimental data in the literature• CO community believes their approach outperforms CC approach• But most studies from CO community compare performance with unblocked (unoptimized) CC codes• If not, what can be done to improve the performance of CO programs?One study• Studied recursive and iterative MMM on Itanium-2• Recursive performs better• But look at MFlops: 30 MFlops• Intel MKL: 6GFlops Piyush Kumar (LNCS 2625)2Organization of talk• CO and CC approaches to blocking– control structures– data structures• Non-standard view of blocking (or why CO may work well)– reduce bandwidth required from memory• Experimental results– UltraSPARC IIIi– Itanium– Xeon– Power 5• Lessons and ongoing workCache-Oblivious AlgorithmsC00= A00*B00+ A01*B10C01= A01*B11+ A00*B01C11= A11*B01+ A10*B01C10= A10*B00+ A11*B10• Divide all dimensions (AD) • 8-way recursive tree down to 1x1 blocks• Bilardi, et. al. – Gray-code order promotes reuse• We use AD in rest of talkA00A01A11A10C00C01C11C10B00B01B11B10A0A1C0C1BC0= A0*BC1= A1*BC11= A11*B01+ A10*B01C10= A10*B00+ A11*B10• Divide largest dimension (LD) • Two-way recursive tree down to 1x1 blocks• Frigo, Leiserson, et. al. CO: recursive micro-kernel• Internal nodes of recursion tree are recursive overhead; roughly– 100 cycles on Itanium-2– 360 cycles on UltraSPARC IIIi• Large overhead: for LD, roughly one internal node per leaf node• Solution:– Micro-kernel: code obtained by complete unrolling of recursive tree for some fixed size problem (RUxRUxRU)– Cut off recursion when sub-problem size becomes equal to micro-kernel size, and invoke micro-kernel– Overhead of internal node is amortized over micro-kernel, rather than a single multiply-add– Choice of RU: empiricalrecursive micro-kernelData Structures• Match data structure layout to access patterns•Improve– Spatial locality– Streaming• Morton-Z is more complicated to implement– Payoff is small or even negative in our experience• Rest of talk: use RBR format with block size matched to microkernelRow-major Row-Block-Row Morton-Z3Cache-conscious algorithmsNBMUKBNA CNBKCache blockingRegister blockingCC algorithms: discussion• Iterative codes– Nested loops• Implementation of blocking– Cache blocking • Mini-kernel: in ATLAS, multiply NBxNB blocks• Choose NB so NB2+ NB + 1 <= CL1– Register blocking • Micro-kernel: in ATLAS, multiply MUx1 block of A with 1xNU block of B into MUxNU block of C• Choose MU,NU so that MU + NU +MU*NU <= NROrganization of talk• CO and CC approaches to blocking– control structures– data structures• Non-standard view of blocking– reduce bandwidth required from memory• Experimental results– UltraSPARC IIIi– Itanium– Xeon–Power 5• Lessons and ongoing workBlocking• Microscopic view– Blocking reduces expected latency of memory access • Macroscopic view– Memory hierarchy can be ignored if• memory has enough bandwidth to feed processor• data can be pre-fetched to hide memory latency– Blocking reduces bandwidth needed from memory • Useful to consider macroscopic view in more detail4Blocking for MMM• Assume processor can perform 1 FMA every cycle• Ideal execution time for NxN MMM = N3cycles• Square blocks: NB x NB • Upper bound for NB:– working set for block computation must fit in cache– size of working set depends on schedule: at most 3NB2– Upper bound on NB: 3NB2≤ Cache Capacity• Lower bound for NB:– data movement in block computation = 4 NB2– total data movement < (N / NB)3* 4 NB2= 4 N3/ NB doubles– required bandwidth from memory = (4 N3/ NB) / (N3 ) = 4 / NB doubles/cycle– Lower bound on NB: 4/NB <Bandwidth between cache and memory• Multi-level memory hierarchy: same idea – sqrt(capacity(L)/3) > NBL> 4/Bandwidth(L,L+1) (levels L,L+1)CPU Cache MemoryExample: MMM on Itanium 2* Bandwidth in doubles per cycle; Limit 4 accesses per cycle between registers and L2• Between L3 and Memory– Constraints• 8 / NBL3 ≤ 0.5•3 * NBL32≤ 524288 (4MB)– Therefore Memory has enough bandwidth for 16 ≤ NBL3 ≤ 418•NBL3= 16 required 8 / NBL3= 0.5 doubles per cycle from Memory•NBL3= 418 required 8 / NBR≈ 0.02 doubles per cycle from Memory•NBL3> 418 possible with better schedulingFPU Registers L2 L3 MemoryL14*≥22*44≥6≈0.52 ≤ NBR≤ 61.33 ≤ B(R,L2) ≤ 42 ≤ NBL2≤ 61.33 ≤ B(L2,L3) ≤ 416 ≤ NBL3≤ 4180.02 ≤ B(L3,Memory) ≤ 0.52 FMAs/cycleLessons• Reducing bandwidth requirements– Block size does not have to be exact– Enough for block size to lie within an interval that depends on hardware parameters– If upper bound on NB is more than twice lower bound, divide and conquer will automatically generate a block size in this rangeÎ approximate blocking CO-style is OK• Reducing latency– Accurate block sizes are better– If block size is chosen approximately, may need to compensate with prefetchingOrganization of talk• Non-standard view of blocking– reduce bandwidth required from memory• CO and CC approaches to blocking– control structures– data structures• Experimental results– UltraSPARC IIIi– Itanium– Xeon–Power 5• Lessons and ongoing work5UltraSPARC IIIi• Peak performance: 2 GFlops (1 GHZ, 2 FPUs)• Memory hierarchy:– Registers: 32– L1 data cache: 64KB, 4-way– L2 data cache: 1MB, 4-way•


View Full Document

UT CS 395T - A Comparison of Cache-conscious and Cache-oblivious Programs

Documents in this Course
TERRA

TERRA

23 pages

OpenCL

OpenCL

15 pages

Byzantine

Byzantine

32 pages

Load more
Download A Comparison of Cache-conscious and Cache-oblivious Programs
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view A Comparison of Cache-conscious and Cache-oblivious Programs and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view A Comparison of Cache-conscious and Cache-oblivious Programs 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?