DOC PREVIEW
UT CS 378 - Lecture Notes

This preview shows page 1-2-22-23 out of 23 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 23 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 23 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 23 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 23 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 23 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

DGEMM ImplementationContentsCache OrganizationThree keywords of cache (L1)SizeBandwidthLatencyPacking AlgorithmBlocking on L1 cacheIncreasing Blocking SizeDGEMM kernel (1) -- START --DGEMM kernel (2) -- Copying for B --DGEMM kernel (3) -- Copying for A --DGEMM kernel – two streams operation -DGEMM kernel (4) -- if m is small --DGEMM kernel -- if m is small --Why do we pack data?Matrix in memoryPacking willIntel Core2(1.85GHz) performanceAMD Opteron(2.2GHz) PerformanceCore2 x 8(2.66GHz) PerformanceOpteron x 8 (2.2GHz) PerformanceDGEMM ImplementationKazushige GotoKazushige Goto<<[email protected]@tacc.utexas.edu>>Contents••Cache organizationCache organization••How How dgemmdgemmkernel workskernel works••PackingPackingCache OrganizationCPUMemoryL1 cacheL2 cacheGenerally cache improves performanceThree keywords of cache (L1)••SizeSize••Pretty small (8kB Pretty small (8kB ––64kB)64kB)••BandwidthBandwidth••How much it can move data per cycleHow much it can move data per cycle••Very wideVery wide••LatencyLatency••Response time to get dataResponse time to get data••Relatively lowRelatively lowSizeCPUMemoryL1 cacheL2 cacheCache size is very small!BandwidthCPUMemoryL1 cacheL2 cacheL1 Bandwidth is much wider than memory bandwidth(over 20 times?)One way, alternateVery wideLatencyCPUMemoryL1 cacheL2 cacheReally closeNearFar and Far awayPacking AlgorithmX=BkBmBnBmBnABmkknCmnBkA : Transposed copyB: Non transposed copyWould be bottleneck on small matrixBlocking on L1 cacheB’A’=XC’Bm BmBn BnBkBkWhat’s the problem?•Kernel may perform 100%•Blocking size is Bm = Bk = 64 ~ 80•Blocking size is too small, copy overhead is large•Copy overhead is 20% of total computation time. Total performance will be 80% of peak.Increasing Blocking SizeB’A’=XC’Bm BmBn BnBkBkSolutions•Increases blocking size•Bm= Bk>= 256•Copy overhead is less than 1% of total computation•Size of A’ will be larger than L1 cacheDGEMM kernel (1) -- START --voidvoidvoidRegistersL1 cacheL2 cacheOriginal dataMain MemoryDGEMM kernel (2) -- Copying for B --RegistersL1 cacheL2 cacheMain MemoryB’’Copy BBB’’B’ B’Resident data is uselessDGEMM kernel (3) -- Copying for A --RegistersL1 cacheL2 cacheMain MemoryA’’AA’’Copy A A’Resident data is uselessBlocking size is half of L2Copy BDGEMM kernel – two streams operation -MUL/ADDRegistersL1 cacheL2 cacheMain MemoryB’AA’BAlways being replaced!Remains residentHave to bring data B into L1 cacheDGEMM kernel (4) -- if m is small --MUL/ADDRegistersL1 cacheL2 cacheMain MemoryB’’BB’’B’Resident data is uselessBlocking size is half of L1Copy ADGEMM kernel -- if m is small --MUL/ADDRegistersL1 cacheL2 cacheMain MemoryB’AA’Always being replaced!Remains residentWhy do we pack data?••Actual memory location is not contiguousActual memory location is not contiguous••Virtual memory mappingVirtual memory mapping••Row or column major, leading dimensionRow or column major, leading dimension••Cache line size / associativeCache line size / associative••Packing will solve above problemPacking will solve above problem••Copy (packing) overhead is a headacheCopy (packing) overhead is a headacheMatrix in memoryColumn MajorLeading dimensionOn memoryCache Line•TLB miss•Use only a part of cache line•Cache conflict may occur and depends on leading dimensionsPacking will••Reduce TLB missesReduce TLB misses••Increase actual cache sizeIncrease actual cache size••Reduce requires bandwidthReduce requires bandwidth••Help hardware/software Help hardware/software prefetchprefetchto work to work effectivelyeffectively••Need extra buffer spaceNeed extra buffer space••Copy overheadCopy overheadIntel Core2(1.85GHz) performance07401480222029603700444051805920666074000 200 400 600 800 1000 1200 1400 1600 1800 2000m = n = kMFlopsGOTO MKL 9.0AMD Opteron(2.2GHz) Performance0480960144019202400288033603840432048000 200 400 600 800 1000 1200 1400 1600 1800 2000m = n = kMFlopsGOTOACML 3.5.0Core2 x 8(2.66GHz) Performance059201184017760236802960035520414404736053280592000 200 400 600 800 1000 1200 1400 1600 1800 2000m = n = kMFlopsGOTO MKL 9.0ATLA S 3.7.21Opteron x 8 (2.2GHz) Performance03840768011520153601920023040268803072034560384000 200 400 600 800 1000 1200 1400 1600 1800 2000m = n = kMFlopsGOTOACML


View Full Document

UT CS 378 - Lecture Notes

Documents in this Course
Epidemics

Epidemics

31 pages

Discourse

Discourse

13 pages

Phishing

Phishing

49 pages

Load more
Download Lecture Notes
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture Notes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture Notes 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?