DOC PREVIEW
UT CS 378 - The Price of Cache-obliviousness

This preview shows page 1-2-15-16-31-32 out of 32 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 32 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 32 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 32 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 32 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 32 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 32 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 32 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

The Price of Cache-obliviousnessMemory Hierarchy ManagementQuestionsOne studyOrganization of talkCache-Oblivious AlgorithmsCO: recursive micro-kernelData StructuresCache-conscious algorithmsCC algorithms: discussionSlide 11BlockingBlocking for MMMExample: MMM on Itanium 2LessonsSlide 16UltraSPARC IIIiNaïve algorithmsMiss ratiosRecursive micro-kernelSlide 21Iterative micro-kernelRecursion + Iterative micro-kernelLoop + iterative micro-kernelSlide 25Recursion + mini-kernelRecursion + mini-kernel + pre-fetchingVendor BLASSlide 29SummaryOngoing WorkSlide 32The Price of Cache-obliviousnessKeshav Pingali, University of Texas, AustinKamen Yotov, Goldman SachsTom Roeder, Cornell UniversityJohn Gunnels, IBM T.J. Watson Research CenterFred Gustavson, IBM T.J.Watson Research CenterMemory Hierarchy Management•Cache-conscious (CC) approach:–Blocked iterative algorithms and arrays (usually)•Code and data structures have parameters that depend on careful blocking for memory hierarchy–Used in dense linear algebra libraries: BLAS, LAPACK•Lots of algorithmic data reuse: O(N3) operations on O(N2) data•Cache-oblivious (CO) approach:–Recursive algorithms and data structures (usually)•Not aware of memory hierarchy: approximate blocking•I/O optimal: Hong and Kung, Frigo and Leiserson–Used in FFT implementations: FFTW •Little algorithmic data reuse: O(N(logN)) computations on O(N) dataQuestions•Does CO approach perform as well as CC approach?–Intuitively, a “self-adaptive” program that is oblivious of some hardware feature should be at a disadvantage.–Little experimental data in the literature•CO community believes their approach outperforms CC approach•But most studies from CO community compare performance with unblocked (unoptimized) CC codes•If not, what can be done to improve the performance of CO programs?One study• Studied recursive and iterative MMM on Itanium-2• Recursive performs better• But look at MFlops: 30 MFlops• Intel MKL: 6GFlops Piyush Kumar (LNCS 2625, 2004)Organization of talk•CO and CC approaches to blocking–control structures–data structures•Non-standard view of blocking (or why CO may work well)–reduce bandwidth required from memory•Experimental results–UltraSPARC IIIi–Itanium–Xeon–Power 5•Lessons and ongoing workCache-Oblivious AlgorithmsC00 = A00*B00 + A01*B10C01 = A01*B11 + A00*B01C11 = A11*B01 + A10*B01C10 = A10*B00 + A11*B10•Divide all dimensions (AD) •8-way recursive tree down to 1x1 blocks•Bilardi, et. al. –Gray-code order promotes reuse•We use AD in rest of talkA00A01A11A10C00C01C11C10B00B01B11B10A0A1C0C1BC0 = A0*BC1 = A1*BC11 = A11*B01 + A10*B01C10 = A10*B00 + A11*B10•Divide largest dimension (LD) •Two-way recursive tree down to 1x1 blocks•Frigo, Leiserson, et. al.CO: recursive micro-kernel•Internal nodes of recursion tree are recursive overhead; roughly–100 cycles on Itanium-2–360 cycles on UltraSPARC IIIi•Large overhead: for LD, roughly one internal node per leaf node•Solution:–Micro-kernel: code obtained by complete unrolling of recursive tree for some fixed size problem (RUxRUxRU)–Cut off recursion when sub-problem size becomes equal to micro-kernel size, and invoke micro-kernel–Overhead of internal node is amortized over micro-kernel, rather than a single multiply-add–Choice of RU: empiricalrecursive micro-kernelData Structures•Match data structure layout to access patterns•Improve–Spatial locality–Streaming•Morton-Z is more complicated to implement–Payoff is small or even negative in our experience•Rest of talk: use RBR format with block size matched to microkernelRow-major Row-Block-Row Morton-ZCache-conscious algorithmsCache blockingRegister blockingCC algorithms: discussion•Iterative codes–Nested loops•Implementation of blocking–Cache blocking •Mini-kernel: in ATLAS, multiply NBxNB blocks•Choose NB so NB2 + NB + 1 <= CL1–Register blocking •Micro-kernel: in ATLAS, multiply MUx1 block of A with 1xNU block of B into MUxNU block of C•Choose MU,NU so that MU + NU +MU*NU <= NROrganization of talk•CO and CC approaches to blocking–control structures–data structures•Non-standard view of blocking–reduce bandwidth required from memory•Experimental results–UltraSPARC IIIi–Itanium–Xeon–Power 5•Lessons and ongoing workBlocking•Microscopic view–Blocking reduces expected latency of memory access •Macroscopic view–Memory hierarchy can be ignored if•memory has enough bandwidth to feed processor•data can be pre-fetched to hide memory latency–Blocking reduces bandwidth needed from memory •Useful to consider macroscopic view in more detailBlocking for MMM•Assume processor can perform 1 FMA every cycle•Ideal execution time for NxN MMM = N3 cycles•Square blocks: NB x NB •Upper bound for NB:–working set for block computation must fit in cache–size of working set depends on schedule: at most 3NB2–Upper bound on NB: 3NB2 ≤ Cache Capacity•Lower bound for NB:–data movement in block computation = 4 NB2–total data movement < (N / NB)3 * 4 NB2 = 4 N3 / NB doubles–required bandwidth from memory = (4 N3 / NB) / (N3 ) = 4 / NB doubles/cycle–Lower bound on NB: 4/NB < Bandwidth between cache and memory•Multi-level memory hierarchy: same idea –sqrt(capacity(L)/3) > NBL > 4/Bandwidth(L,L+1) (levels L,L+1)CPU Cache MemoryExample: MMM on Itanium 2* Bandwidth in doubles per cycle; Limit 4 accesses per cycle between registers and L2•Between L3 and Memory–Constraints•8 / NBL3 ≤ 0.5•3 * NBL32 ≤ 524288 (4MB)–Therefore Memory has enough bandwidth for 16 ≤ NBL3 ≤ 418•NBL3 = 16 required 8 / NBL3 = 0.5 doubles per cycle from Memory•NBL3 = 418 required 8 / NBR ≈ 0.02 doubles per cycle from Memory•NBL3 > 418 possible with better schedulingFPU Registers L2 L3 MemoryL14*≥22*44≥6≈0.52 ≤ NBR ≤ 61.33 ≤ B(R,L2) ≤ 42 ≤ NBL2 ≤ 61.33 ≤ B(L2,L3) ≤ 416 ≤ NBL3 ≤ 4180.02 ≤ B(L3,Memory) ≤ 0.52 FMAs/cycleLessons•Reducing bandwidth requirements–Block size does not have to be exact–Enough for block size to lie within an interval that depends on hardware parameters–If upper bound on NB is more than twice lower bound, divide and conquer will automatically generate a block size in this range approximate blocking CO-style is OK•Reducing latency–Accurate block sizes are better–If block size is chosen approximately, may need to compensate with


View Full Document

UT CS 378 - The Price of Cache-obliviousness

Documents in this Course
Epidemics

Epidemics

31 pages

Discourse

Discourse

13 pages

Phishing

Phishing

49 pages

Load more
Download The Price of Cache-obliviousness
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view The Price of Cache-obliviousness and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view The Price of Cache-obliviousness 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?