UT CS 378 - The Price of Cache-obliviousness - D691147

Home> Schools> University of Texas at Austin> Computer Science (CS) > CS 378> The Price of Cache-obliviousness

DOC PREVIEW

UT CS 378 - The Price of Cache-obliviousness

School name University of Texas at Austin

Course Cs 378- Declarative Programming

Pages 32

This preview shows page 1-2-15-16-31-32 out of 32 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 32 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 32 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 32 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 32 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 32 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 32 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 32 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

The Price of Cache-obliviousnessMemory Hierarchy ManagementQuestionsOne studyOrganization of talkCache-Oblivious AlgorithmsCO: recursive micro-kernelData StructuresCache-conscious algorithmsCC algorithms: discussionSlide 11BlockingBlocking for MMMExample: MMM on Itanium 2LessonsSlide 16UltraSPARC IIIiNaïve algorithmsMiss ratiosRecursive micro-kernelSlide 21Iterative micro-kernelRecursion + Iterative micro-kernelLoop + iterative micro-kernelSlide 25Recursion + mini-kernelRecursion + mini-kernel + pre-fetchingVendor BLASSlide 29SummaryOngoing WorkSlide 32The Price of Cache-obliviousnessKeshav Pingali, University of Texas, AustinKamen Yotov, Goldman SachsTom Roeder, Cornell UniversityJohn Gunnels, IBM T.J. Watson Research CenterFred Gustavson, IBM T.J.Watson Research CenterMemory Hierarchy Management•Cache-conscious (CC) approach:–Blocked iterative algorithms and arrays (usually)•Code and data structures have parameters that depend on careful blocking for memory hierarchy–Used in dense linear algebra libraries: BLAS, LAPACK•Lots of algorithmic data reuse: O(N3) operations on O(N2) data•Cache-oblivious (CO) approach:–Recursive algorithms and data structures (usually)•Not aware of memory hierarchy: approximate blocking•I/O optimal: Hong and Kung, Frigo and Leiserson–Used in FFT implementations: FFTW •Little algorithmic data reuse: O(N(logN)) computations on O(N) dataQuestions•Does CO approach perform as well as CC approach?–Intuitively, a “self-adaptive” program that is oblivious of some hardware feature should be at a disadvantage.–Little experimental data in the literature•CO community believes their approach outperforms CC approach•But most studies from CO community compare performance with unblocked (unoptimized) CC codes•If not, what can be done to improve the performance of CO programs?One study• Studied recursive and iterative MMM on Itanium-2• Recursive performs better• But look at MFlops: 30 MFlops• Intel MKL: 6GFlops Piyush Kumar (LNCS 2625, 2004)Organization of talk•CO and CC approaches to blocking–control structures–data structures•Non-standard view of blocking (or why CO may work well)–reduce bandwidth required from memory•Experimental results–UltraSPARC IIIi–Itanium–Xeon–Power 5•Lessons and ongoing workCache-Oblivious AlgorithmsC00 = A00*B00 + A01*B10C01 = A01*B11 + A00*B01C11 = A11*B01 + A10*B01C10 = A10*B00 + A11*B10•Divide all dimensions (AD) •8-way recursive tree down to 1x1 blocks•Bilardi, et. al. –Gray-code order promotes reuse•We use AD in rest of talkA00A01A11A10C00C01C11C10B00B01B11B10A0A1C0C1BC0 = A0*BC1 = A1*BC11 = A11*B01 + A10*B01C10 = A10*B00 + A11*B10•Divide largest dimension (LD) •Two-way recursive tree down to 1x1 blocks•Frigo, Leiserson, et. al.CO: recursive micro-kernel•Internal nodes of recursion tree are recursive overhead; roughly–100 cycles on Itanium-2–360 cycles on UltraSPARC IIIi•Large overhead: for LD, roughly one internal node per leaf node•Solution:–Micro-kernel: code obtained by complete unrolling of recursive tree for some fixed size problem (RUxRUxRU)–Cut off recursion when sub-problem size becomes equal to micro-kernel size, and invoke micro-kernel–Overhead of internal node is amortized over micro-kernel, rather than a single multiply-add–Choice of RU: empiricalrecursive micro-kernelData Structures•Match data structure layout to access patterns•Improve–Spatial locality–Streaming•Morton-Z is more complicated to implement–Payoff is small or even negative in our experience•Rest of talk: use RBR format with block size matched to microkernelRow-major Row-Block-Row Morton-ZCache-conscious algorithmsCache blockingRegister blockingCC algorithms: discussion•Iterative codes–Nested loops•Implementation of blocking–Cache blocking •Mini-kernel: in ATLAS, multiply NBxNB blocks•Choose NB so NB2 + NB + 1 <= CL1–Register blocking •Micro-kernel: in ATLAS, multiply MUx1 block of A with 1xNU block of B into MUxNU block of C•Choose MU,NU so that MU + NU +MU*NU <= NROrganization of talk•CO and CC approaches to blocking–control structures–data structures•Non-standard view of blocking–reduce bandwidth required from memory•Experimental results–UltraSPARC IIIi–Itanium–Xeon–Power 5•Lessons and ongoing workBlocking•Microscopic view–Blocking reduces expected latency of memory access •Macroscopic view–Memory hierarchy can be ignored if•memory has enough bandwidth to feed processor•data can be pre-fetched to hide memory latency–Blocking reduces bandwidth needed from memory •Useful to consider macroscopic view in more detailBlocking for MMM•Assume processor can perform 1 FMA every cycle•Ideal execution time for NxN MMM = N3 cycles•Square blocks: NB x NB •Upper bound for NB:–working set for block computation must fit in cache–size of working set depends on schedule: at most 3NB2–Upper bound on NB: 3NB2 ≤ Cache Capacity•Lower bound for NB:–data movement in block computation = 4 NB2–total data movement < (N / NB)3 * 4 NB2 = 4 N3 / NB doubles–required bandwidth from memory = (4 N3 / NB) / (N3 ) = 4 / NB doubles/cycle–Lower bound on NB: 4/NB < Bandwidth between cache and memory•Multi-level memory hierarchy: same idea –sqrt(capacity(L)/3) > NBL > 4/Bandwidth(L,L+1) (levels L,L+1)CPU Cache MemoryExample: MMM on Itanium 2* Bandwidth in doubles per cycle; Limit 4 accesses per cycle between registers and L2•Between L3 and Memory–Constraints•8 / NBL3 ≤ 0.5•3 * NBL32 ≤ 524288 (4MB)–Therefore Memory has enough bandwidth for 16 ≤ NBL3 ≤ 418•NBL3 = 16 required 8 / NBL3 = 0.5 doubles per cycle from Memory•NBL3 = 418 required 8 / NBR ≈ 0.02 doubles per cycle from Memory•NBL3 > 418 possible with better schedulingFPU Registers L2 L3 MemoryL14*≥22*44≥6≈0.52 ≤ NBR ≤ 61.33 ≤ B(R,L2) ≤ 42 ≤ NBL2 ≤ 61.33 ≤ B(L2,L3) ≤ 416 ≤ NBL3 ≤ 4180.02 ≤ B(L3,Memory) ≤ 0.52 FMAs/cycleLessons•Reducing bandwidth requirements–Block size does not have to be exact–Enough for block size to lie within an interval that depends on hardware parameters–If upper bound on NB is more than twice lower bound, divide and conquer will automatically generate a block size in this range approximate blocking CO-style is OK•Reducing latency–Accurate block sizes are better–If block size is chosen approximately, may need to compensate with

View Full Document