CORNELL CS 612 - Lecture Notes - D222112

Home> Schools> Cornell University> Computer Science (CS) > CS 612> Lecture Notes

DOC PREVIEW

CORNELL CS 612 - Lecture Notes

School name Cornell University

Course Cs 612- Compiler Design for High-Performance Architectures

Pages 25

This preview shows page 1-2-24-25 out of 25 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 25 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 25 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 25 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 25 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 25 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

'&$%Homework #1:• http://www.cs.cornell.edu/Courses/cs612/2002SP/assignments/ps1/ps1.htm• Due Tuesday, February 14th.1'&$%Cache ModelsandProgram Transformations2'&$%Goal of this lecture• We have looked at computational science applications, andisolated key kernels (MVM,MMM,linear system solvers,...).• We have studied caches and virtual memory, and weunderstand what causes cache misses (cold, capacity, conﬂict).• Let us look at how to make some of the kernels run well onmachines with caches.3'&$%Matrix-vector ProductxyAijCode:for i = 1,Nfor j = 1,Ny(i) = y(i) + A(i,j)*x(j)Total number of references =4N2We want to study two questions.• Can we predict the miss ratio of diﬀerent variations of this programfor diﬀerent cache models?• What transformations can we do to improve performance?That is, how do we improve the miss ratio?4'&$%Reuse Distance: If r1and r2are two references to the same cacheline in some memory stream, reuseDistance(r1,r2)isthenumberof distinct cache lines referenced between r1and r2.Cache model:• fully associative cache (so no conﬂict misses)• LRU replacement strategy• We will look at two extremes• large cache model: no capacity misses• small cache model: miss if reuse distance is some function ofproblem size (size of arrays)5'&$%Scenario IxyAijCache model:• fully associative cache (no conﬂict misses)• LRU replacement strategy• cache line size = 1 ﬂoating-point numberSmall cache: assume cache can hold fewer than (2N+1) numbersMisses:• matrix A: N2cold misses• vector x: N cold misses + N(N − 1) capacity misses• vector y: N cold misses• Miss ratio = (2N2+ N)/4N2→ 0.56'&$%Large cache model: cache can hold more than (2N+1) numbersMisses:• matrix A: N2cold misses• vector x: N cold misses• vector y: N cold misses• Miss ratio = (N2+2N)/4N2→ 0.25miss ratioN0.250.500.751.00c/2c = size of cache in # of fp’slargecachesmallcache7'&$%Scenario IIyAijxSame cache model as Scenario I but diﬀerent codeCode: walk matrix A by columnsfor j = 1,Nfor i = 1,N //SAXPYy(i) = y(i) + A(i,j)*x(j)It is easy to show that miss ratios are identical to Scenario I.8'&$%Scenario IIIxyAijCache model:• fully associative cache (no conﬂict misses)• LRU replacement strategy• cache line size = b ﬂoating-point numbers(can exploit spatial locality)Code: (original) i-j loop orderfor i = 1,Nfor j = 1,Ny(i) = y(i) + A(i,j)*x(j)Let us assume A is stored in row-major order.9'&$%Small cache:Misses:• matrix A: N2/b cold misses• vector x: N/b cold misses + N(N − 1)/b capacity misses• vector y: N/b cold misses• Miss ratio = (1/2 + 1/4N)*(1/b) → 1/2bLarge cache:Misses:• matrix A: N2/b cold misses• vector x: N/b cold misses• vector y: N/b cold misses• Miss ratio = (1/4 + 1/2N)*(1/b) → 1/4bTransition from small cache to large cache when 2N + b = c.10'&$%N1.00c/2c = size of cache in # of fp’slarge small1/4b1/2bcache cachemiss ratiob = number of fp’s in one cache lineMiss ratios for Scenario IIILet us plug in some numbers for SGI Octane:• Line size = 32 bytes ⇒ b=4• Cache size = 32 Kb ⇒ c=4K• Large cache miss ratio = 1/16 = 0.06• Small cache miss ratio = 0.12• Small/large transition size = 2000 = N11'&$%Scenario IVyAijxCache model:• fully associative cache (no conﬂict misses)• LRU replacement strategy• cache line size = b ﬂoating-point numbers(can exploit spatial locality)Code: j-i loop orderfor j = 1,Nfor i = 1,Ny(i) = y(i) + A(i,j)*x(j)Note: we are not walking over A in memory layout order12'&$%Small cache:Misses:• matrix A: N2/ cold misses• vector x: N/b cold misses• vector y: N/b cold misses + N(N − 1)/b capacity misses• Miss ratio = 0.25*(1+ 1/b) + 1/4Nb → 0.25*(1+1/b)Large cache:Misses:• matrix A: N2/b cold misses• vector x: N/b cold misses• vector y: N/b cold misses• Miss ratio = (1/4 + 1/2N)*(1/b) → 1/4bTransition from small cache to large cache when c > bN + N +b13'&$%N1.00c = size of cache in # of fp’slargecachemiss ratiob = number of fp’s in one cache line0.25(1 + 1/b)0.25/bc/(b+1)smallcacheMiss ratios for Scenario IVLet us plug some numbers in for SGI Octane:• Line size = 32 bytes ⇒ b=4• Cache size = 32 Kb ⇒ c=4K• Large cache miss ratio = 1/16 = 0.06• Small cache miss ratio = 0.31• Small/large transition size = 80014'&$%Scenario V: Blocked CodexyAijBBCode:for bi = 1, N/Bfor bj = 1, N/Bfor i = (bi-1)*B +1, bi*Bfor j = (bj-1)*B +1, bj*By(i) = y(i) + A(i,j)*x(j)15'&$%• Pick block size B so that you eﬀectively have large cache modelwhile executing code within block (2B = c).• Misses within a block:• matrix A: B2/b cold misses• vector x: B/b• vector y: B/b• Total number of block computations = (N/B )2• Miss ratio = (0.25 + 1/2B)*1/b → 0.25/b• For Octane, we have miss ratio is roughly 0.06 independent ofproblem size.16'&$%Putting it all together for SGI OctaneN1.000.060.122000Point codeBlocked codemiss ratioMiss ratio predictions for MVM point and blocked codesWe have assumed a fully associative cache.Conﬂict misses will have the eﬀect of reducing eﬀective cache size,so transition from large to small cache model should happen soonerthan predicted.17'&$%MVM L1-cache Miss Rates00.020.040.060.080.10.121003005007009001100130015001700190021002300250027002900sizemiss rateoriginal codeblocked, B=50Experimental Results on SGI OctanePredictions agree reasonably well with experiments.18'&$%Key transformations• Loop permutationfor j = 1, N for i = 1, Nfor i = 1, N => for j = 1, Ny(i) = y(i) + A(i,j)*x(j) y(i) = y(i) + A(i,j)*x(j)• Loop tilingfor i = 1, N for bi = 1, N/Bfor j = 1, N => for bj = 1, N/By(i) = y(i)+A(i,j)*x(j) for i = (bi-1)*B +1, bi*Bfor j = (bj-1)*B +1, bj*By(i) = y(i) + A(i,j)*x(j)•Warning: these transformations may be illegal in some codes.19'&$%Matrix-matrix ProductijkkCABCode:for i = 1,Nfor j = 1,Nfor k = 1,NC(i,j) = C(i,j) + A(i,k)*B(k,j)Cache model: assume cache line size is b fp’s20'&$%ijkkCABbSmall cache:Misses for each cache line of C:• matrix A: N/b• matrix B: b ∗ N• matrix C: b• Total number of misses per cache line of C = N(b +1)+bTotal number of misses = N2/b ∗ (N(b +1)+1)→ N3(b +1)/bTotal number of references = 4N3Miss ratio → 0.25(b +1)/b21'&$%Large cache:Cold misses = 3 ∗ N2/bMiss ratio = 3 ∗ N2/4bN3=0.75/bNFor large cache model, miss ratio decreases as the size of theproblem increases!Intuition: lot of data reuse, so once matrices all ﬁt into cache, codegoes full

View Full Document


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-24-25 out of 25 pages.

CORNELL CS 612 - Lecture Notes

Sign up for free to view:

Please select your school