UT CS 378 - Lecture Notes - D1797643

Home> Schools> University of Texas at Austin> Computer Science (CS) > CS 378> Lecture Notes

DOC PREVIEW

UT CS 378 - Lecture Notes

School name University of Texas at Austin

Course Cs 378- Declarative Programming

Pages 23

This preview shows page 1-2-22-23 out of 23 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 23 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 23 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 23 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 23 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 23 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

DGEMM ImplementationContentsCache OrganizationThree keywords of cache (L1)SizeBandwidthLatencyPacking AlgorithmBlocking on L1 cacheIncreasing Blocking SizeDGEMM kernel (1) -- START --DGEMM kernel (2) -- Copying for B --DGEMM kernel (3) -- Copying for A --DGEMM kernel – two streams operation -DGEMM kernel (4) -- if m is small --DGEMM kernel -- if m is small --Why do we pack data?Matrix in memoryPacking willIntel Core2(1.85GHz) performanceAMD Opteron(2.2GHz) PerformanceCore2 x 8(2.66GHz) PerformanceOpteron x 8 (2.2GHz) PerformanceDGEMM ImplementationKazushige GotoKazushige Goto<<[email protected]@tacc.utexas.edu>>Contents••Cache organizationCache organization••How How dgemmdgemmkernel workskernel works••PackingPackingCache OrganizationCPUMemoryL1 cacheL2 cacheGenerally cache improves performanceThree keywords of cache (L1)••SizeSize••Pretty small (8kB Pretty small (8kB ––64kB)64kB)••BandwidthBandwidth••How much it can move data per cycleHow much it can move data per cycle••Very wideVery wide••LatencyLatency••Response time to get dataResponse time to get data••Relatively lowRelatively lowSizeCPUMemoryL1 cacheL2 cacheCache size is very small!BandwidthCPUMemoryL1 cacheL2 cacheL1 Bandwidth is much wider than memory bandwidth(over 20 times?)One way, alternateVery wideLatencyCPUMemoryL1 cacheL2 cacheReally closeNearFar and Far awayPacking AlgorithmX=BkBmBnBmBnABmkknCmnBkA : Transposed copyB: Non transposed copyWould be bottleneck on small matrixBlocking on L1 cacheB’A’＝ＸC’Bm BmBn BnBkBkWhat’s the problem?•Kernel may perform 100%•Blocking size is Bm = Bk = 64 ~ 80•Blocking size is too small, copy overhead is large•Copy overhead is 20% of total computation time. Total performance will be 80% of peak.Increasing Blocking SizeB’A’＝ＸC’Bm BmBn BnBkBkSolutions•Increases blocking size•Bm= Bk>= 256•Copy overhead is less than 1% of total computation•Size of A’ will be larger than L1 cacheDGEMM kernel (1) -- START --voidvoidvoidRegistersL1 cacheL2 cacheOriginal dataMain MemoryDGEMM kernel (2) -- Copying for B --RegistersL1 cacheL2 cacheMain MemoryB’’Copy BBB’’B’ B’Resident data is uselessDGEMM kernel (3) -- Copying for A --RegistersL1 cacheL2 cacheMain MemoryA’’AA’’Copy A A’Resident data is uselessBlocking size is half of L2Copy BDGEMM kernel – two streams operation -MUL/ADDRegistersL1 cacheL2 cacheMain MemoryB’AA’BAlways being replaced!Remains residentHave to bring data B into L1 cacheDGEMM kernel (4) -- if m is small --MUL/ADDRegistersL1 cacheL2 cacheMain MemoryB’’BB’’B’Resident data is uselessBlocking size is half of L1Copy ADGEMM kernel -- if m is small --MUL/ADDRegistersL1 cacheL2 cacheMain MemoryB’AA’Always being replaced!Remains residentWhy do we pack data?••Actual memory location is not contiguousActual memory location is not contiguous••Virtual memory mappingVirtual memory mapping••Row or column major, leading dimensionRow or column major, leading dimension••Cache line size / associativeCache line size / associative••Packing will solve above problemPacking will solve above problem••Copy (packing) overhead is a headacheCopy (packing) overhead is a headacheMatrix in memoryColumn MajorLeading dimensionOn memoryCache Line•TLB miss•Use only a part of cache line•Cache conflict may occur and depends on leading dimensionsPacking will••Reduce TLB missesReduce TLB misses••Increase actual cache sizeIncrease actual cache size••Reduce requires bandwidthReduce requires bandwidth••Help hardware/software Help hardware/software prefetchprefetchto work to work effectivelyeffectively••Need extra buffer spaceNeed extra buffer space••Copy overheadCopy overheadIntel Core2(1.85GHz) performance07401480222029603700444051805920666074000 200 400 600 800 1000 1200 1400 1600 1800 2000m = n = kMFlopsGOTO MKL 9.0AMD Opteron(2.2GHz) Performance0480960144019202400288033603840432048000 200 400 600 800 1000 1200 1400 1600 1800 2000m = n = kMFlopsGOTOACML 3.5.0Core2 x 8(2.66GHz) Performance059201184017760236802960035520414404736053280592000 200 400 600 800 1000 1200 1400 1600 1800 2000m = n = kMFlopsGOTO MKL 9.0ATLA S 3.7.21Opteron x 8 (2.2GHz) Performance03840768011520153601920023040268803072034560384000 200 400 600 800 1000 1200 1400 1600 1800 2000m = n = kMFlopsGOTOACML

View Full Document


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-22-23 out of 23 pages.

UT CS 378 - Lecture Notes

Sign up for free to view:

Please select your school