DOC PREVIEW
CORNELL CS 612 - Lecture Notes

This preview shows page 1-2-24-25 out of 25 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 25 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 25 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 25 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 25 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 25 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

'&$%Homework #1:• http://www.cs.cornell.edu/Courses/cs612/2002SP/assignments/ps1/ps1.htm• Due Tuesday, February 14th.1'&$%Cache ModelsandProgram Transformations2'&$%Goal of this lecture• We have looked at computational science applications, andisolated key kernels (MVM,MMM,linear system solvers,...).• We have studied caches and virtual memory, and weunderstand what causes cache misses (cold, capacity, conflict).• Let us look at how to make some of the kernels run well onmachines with caches.3'&$%Matrix-vector ProductxyAijCode:for i = 1,Nfor j = 1,Ny(i) = y(i) + A(i,j)*x(j)Total number of references =4N2We want to study two questions.• Can we predict the miss ratio of different variations of this programfor different cache models?• What transformations can we do to improve performance?That is, how do we improve the miss ratio?4'&$%Reuse Distance: If r1and r2are two references to the same cacheline in some memory stream, reuseDistance(r1,r2)isthenumberof distinct cache lines referenced between r1and r2.Cache model:• fully associative cache (so no conflict misses)• LRU replacement strategy• We will look at two extremes• large cache model: no capacity misses• small cache model: miss if reuse distance is some function ofproblem size (size of arrays)5'&$%Scenario IxyAijCache model:• fully associative cache (no conflict misses)• LRU replacement strategy• cache line size = 1 floating-point numberSmall cache: assume cache can hold fewer than (2N+1) numbersMisses:• matrix A: N2cold misses• vector x: N cold misses + N(N − 1) capacity misses• vector y: N cold misses• Miss ratio = (2N2+ N)/4N2→ 0.56'&$%Large cache model: cache can hold more than (2N+1) numbersMisses:• matrix A: N2cold misses• vector x: N cold misses• vector y: N cold misses• Miss ratio = (N2+2N)/4N2→ 0.25miss ratioN0.250.500.751.00c/2c = size of cache in # of fp’slargecachesmallcache7'&$%Scenario IIyAijxSame cache model as Scenario I but different codeCode: walk matrix A by columnsfor j = 1,Nfor i = 1,N //SAXPYy(i) = y(i) + A(i,j)*x(j)It is easy to show that miss ratios are identical to Scenario I.8'&$%Scenario IIIxyAijCache model:• fully associative cache (no conflict misses)• LRU replacement strategy• cache line size = b floating-point numbers(can exploit spatial locality)Code: (original) i-j loop orderfor i = 1,Nfor j = 1,Ny(i) = y(i) + A(i,j)*x(j)Let us assume A is stored in row-major order.9'&$%Small cache:Misses:• matrix A: N2/b cold misses• vector x: N/b cold misses + N(N − 1)/b capacity misses• vector y: N/b cold misses• Miss ratio = (1/2 + 1/4N)*(1/b) → 1/2bLarge cache:Misses:• matrix A: N2/b cold misses• vector x: N/b cold misses• vector y: N/b cold misses• Miss ratio = (1/4 + 1/2N)*(1/b) → 1/4bTransition from small cache to large cache when 2N + b = c.10'&$%N1.00c/2c = size of cache in # of fp’slarge small1/4b1/2bcache cachemiss ratiob = number of fp’s in one cache lineMiss ratios for Scenario IIILet us plug in some numbers for SGI Octane:• Line size = 32 bytes ⇒ b=4• Cache size = 32 Kb ⇒ c=4K• Large cache miss ratio = 1/16 = 0.06• Small cache miss ratio = 0.12• Small/large transition size = 2000 = N11'&$%Scenario IVyAijxCache model:• fully associative cache (no conflict misses)• LRU replacement strategy• cache line size = b floating-point numbers(can exploit spatial locality)Code: j-i loop orderfor j = 1,Nfor i = 1,Ny(i) = y(i) + A(i,j)*x(j)Note: we are not walking over A in memory layout order12'&$%Small cache:Misses:• matrix A: N2/ cold misses• vector x: N/b cold misses• vector y: N/b cold misses + N(N − 1)/b capacity misses• Miss ratio = 0.25*(1+ 1/b) + 1/4Nb → 0.25*(1+1/b)Large cache:Misses:• matrix A: N2/b cold misses• vector x: N/b cold misses• vector y: N/b cold misses• Miss ratio = (1/4 + 1/2N)*(1/b) → 1/4bTransition from small cache to large cache when c > bN + N +b13'&$%N1.00c = size of cache in # of fp’slargecachemiss ratiob = number of fp’s in one cache line0.25(1 + 1/b)0.25/bc/(b+1)smallcacheMiss ratios for Scenario IVLet us plug some numbers in for SGI Octane:• Line size = 32 bytes ⇒ b=4• Cache size = 32 Kb ⇒ c=4K• Large cache miss ratio = 1/16 = 0.06• Small cache miss ratio = 0.31• Small/large transition size = 80014'&$%Scenario V: Blocked CodexyAijBBCode:for bi = 1, N/Bfor bj = 1, N/Bfor i = (bi-1)*B +1, bi*Bfor j = (bj-1)*B +1, bj*By(i) = y(i) + A(i,j)*x(j)15'&$%• Pick block size B so that you effectively have large cache modelwhile executing code within block (2B = c).• Misses within a block:• matrix A: B2/b cold misses• vector x: B/b• vector y: B/b• Total number of block computations = (N/B )2• Miss ratio = (0.25 + 1/2B)*1/b → 0.25/b• For Octane, we have miss ratio is roughly 0.06 independent ofproblem size.16'&$%Putting it all together for SGI OctaneN1.000.060.122000Point codeBlocked codemiss ratioMiss ratio predictions for MVM point and blocked codesWe have assumed a fully associative cache.Conflict misses will have the effect of reducing effective cache size,so transition from large to small cache model should happen soonerthan predicted.17'&$%MVM L1-cache Miss Rates00.020.040.060.080.10.121003005007009001100130015001700190021002300250027002900sizemiss rateoriginal codeblocked, B=50Experimental Results on SGI OctanePredictions agree reasonably well with experiments.18'&$%Key transformations• Loop permutationfor j = 1, N for i = 1, Nfor i = 1, N => for j = 1, Ny(i) = y(i) + A(i,j)*x(j) y(i) = y(i) + A(i,j)*x(j)• Loop tilingfor i = 1, N for bi = 1, N/Bfor j = 1, N => for bj = 1, N/By(i) = y(i)+A(i,j)*x(j) for i = (bi-1)*B +1, bi*Bfor j = (bj-1)*B +1, bj*By(i) = y(i) + A(i,j)*x(j)•Warning: these transformations may be illegal in some codes.19'&$%Matrix-matrix ProductijkkCABCode:for i = 1,Nfor j = 1,Nfor k = 1,NC(i,j) = C(i,j) + A(i,k)*B(k,j)Cache model: assume cache line size is b fp’s20'&$%ijkkCABbSmall cache:Misses for each cache line of C:• matrix A: N/b• matrix B: b ∗ N• matrix C: b• Total number of misses per cache line of C = N(b +1)+bTotal number of misses = N2/b ∗ (N(b +1)+1)→ N3(b +1)/bTotal number of references = 4N3Miss ratio → 0.25(b +1)/b21'&$%Large cache:Cold misses = 3 ∗ N2/bMiss ratio = 3 ∗ N2/4bN3=0.75/bNFor large cache model, miss ratio decreases as the size of theproblem increases!Intuition: lot of data reuse, so once matrices all fit into cache, codegoes full


View Full Document

CORNELL CS 612 - Lecture Notes

Download Lecture Notes
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture Notes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture Notes 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?