Matrix Multiply ImplementationContentsCache OrganizationThree keywords of cache (L1)SizeBandwidthLatencyWhy do we pack data?Matrix in memoryPacking willPacking AlgorithmBlocking on L1 cacheIncreasing Blocking SizeDGEMM kernel (1) -- START --DGEMM kernel (2) -- Copying for B --DGEMM kernel (3) -- Copying for A --DGEMM kernel – two streams operation -Problem and SolutionIntel Core2(2.66GHz) performanceCore2 x 8(2.66GHz) PerformanceAMD Opteron(2.2GHz) PerformanceOpteron x 8 (2.2GHz) PerformanceAMD Barcelona(2GHz) PerformanceBarcelona x 16 (2GHz) PerformanceMatrix Multiply ImplementationKazushige GotoKazushige Goto<<[email protected]@tacc.utexas.edu>>Contents•• Cache organizationCache organization•• Packing dataPacking data•• How MM (DGEMM) kernel worksHow MM (DGEMM) kernel worksCache OrganizationCPUMemoryL1 cacheL2 cacheGenerally cache improves performanceThree keywords of cache (L1)•• SizeSize••Pretty small (8kB Pretty small (8kB –– 64kB)64kB)•• BandwidthBandwidth••How much it can move data per cycleHow much it can move data per cycle••Very wideVery wide•• LatencyLatency••Response time to get dataResponse time to get data••Relatively lowRelatively lowSizeCPUMemoryL1 cacheL2 cacheCache size is very small!BandwidthCPUMemoryL1 cacheL2 cacheL1 Bandwidth is much wider than memory bandwidth (over 20 times?)One way, alternateVery wideLatencyCPUMemoryL1 cacheL2 cacheReally closeNearFar and Far awayWhy do we pack data?•• Actual memory location is not contiguousActual memory location is not contiguous•• Virtual memory mappingVirtual memory mapping•• Row or column major, leading dimensionRow or column major, leading dimension•• Cache line size / associativeCache line size / associative•• Packing will solve above problemPacking will solve above problem•• Copy (packing) overheadCopy (packing) overhead is a headacheis a headacheMatrix in memoryColumn MajorLeading dimensionOn memoryCache Line•TLB miss•Use only a part of cache line•Cache conflict may occur and depends on leading dimensionsPacking will•• Reduce TLB missesReduce TLB misses•• Increase effective cache sizeIncrease effective cache size•• Reduce required bandwidthReduce required bandwidth•• Help hardware/software Help hardware/software prefetchprefetch to work to work effectivelyeffectively•• Need extra buffer spaceNeed extra buffer space•• Copy overheadCopy overheadPacking AlgorithmX =BkBmBnBmBnA BmkknCmnBkA : Transposed copyB: Non transposed copyWould be bottleneck on small matrixBlocking on L1 cacheB’A’=XC’Bm BmBn BnBkBkWhat’s the problem?•Kernel may perform 100%•Blocking size is Bm = Bk = 64 ~ 80•If blocking size is too small, copy overhead is heavy•Copy overhead is 20% of total computation time. Total performance will be 80% of peak at most.Increasing Blocking SizeB’A’=XC’Bm BmBn BnBkBkSolutions• Bm, Bk >= 256•Copy overhead is less than 1% of total computation•Streaming data from A (on L2) is a keyDGEMM kernel (1) -- START --voidvoidvoidRegistersL1 cacheL2 cacheOriginal dataMain MemoryDGEMM kernel (2) -- Copying for B --RegistersL1 cacheL2 cacheMain MemoryB’’Copy BBB’’B’ B’Resident data is uselessDGEMM kernel (3) -- Copying for A --RegistersL1 cacheL2 cacheMain MemoryA’’AA’’Copy A A’Resident data is uselessBlocking size is half of L2Copy BDGEMM kernel – two streams operation -MUL/ADDRegistersL1 cacheL2 cacheMain MemoryB’AA’BAlways being replaced!Remains residentHave to bring data B into L1 cacheProblem and Solution•• Operation is against cache policy (LRU)Operation is against cache policy (LRU)¾¾BB’’is on L1 cache, Ais on L1 cache, A’’is always replacedis always replaced¾¾Use Use prefetchprefetchinstruction to give a hintinstruction to give a hint•• Bandwidth from L2 is relatively narrowBandwidth from L2 is relatively narrow¾¾Control it by changing unrolling typeControl it by changing unrolling type•• Latency from L2 is pretty longLatency from L2 is pretty long¾¾Use software Use software prefetchprefetchIntel Core2(2.66GHz) performance0106421283192425653206384744885129576106400 200 400 600 800 1000 1200 1400 1600 1800 2000Matrix OrderMFlopsGOTO-1.23 MKL-10.0Core2 x 8(2.66GHz) Performance085121702425536340484256051072595846809676608851200 500 1000 1500 2000 2500 3000 3500 4000Matrix OrderMFlopsGOTO MKLAMD Opteron(2.2GHz) Performance05201040156020802600312036404160468052000 500 1000 1500 2000Matrix Order (m = n = k)MFlopsGOTO-1.23 ACML-4.0.1Opteron x 8 (2.2GHz) Performance03840768011520153601920023040268803072034560384000 200 400 600 800 1000 1200 1400 1600 1800 2000m = n = kMFlopsGOTOACML 3.5.0AMD Barcelona(2GHz) Performance08001600240032004000480056006400720080000 200 400 600 800 1000 1200 1400 1600 1800 2000Matrix OrderMFlopsGOTO ACMLBarcelona x 16 (2GHz) Performance012.825.638.451.26476.889.6102.4115.21280 1000 2000 3000 4000Matrix OrderGFlopsGOTO- 1.23ACML-
View Full Document