Optimizing MMM & ATLAS Library GeneratorRecall: MMM miss ratiosIJK version (large cache)IJK version (small cache)MMM experimentsHow large can matrices be and still not suffer capacity misses?Check with experimentHigh-level picture of high-performance MMM codeATLASOur approachBLASOptimizationsCache-level blocking (tiling)Register-level blockingSchedulingSearch StrategyFind Best NBModel-based optimizationModeling for Optimization ParametersLargest NB for no capacity/conflict missesLargest NB for no capacity missesSummary: Modeling for Tile Size (NB)Summary of modelExperimentsSome sensitivity graphs for Alpha 21264Eliminating performance gapsSmall performance gap: Alpha 21264Large performance gap: ItaniumItanium diagnosis and solutionLarge performance gap: OpteronOpteron diagnosis and solutionRefined modelBottom lineFuture DirectionsOptimizing MMM & ATLAS Library GeneratorRecall: MMM miss ratios L1 Cache Miss Ratio for Intel Pentium III–MMM with N = 1…1300–16KB 32B/Block 4-way 8-byte elementsIJK version (large cache)DO I = 1, N//row-major storage DO J = 1, N DO K = 1, N C(I,J) = C(I,J) + A(I,K)*B(K,J)Large cache scenario:Matrices are small enough to fit into cacheOnly cold misses, no capacity missesMiss ratio: Data size = 3 N2 Each miss brings in b floating-point numbersMiss ratio = 3 N2 /b*4N3 = 0.75/bN = 0.019 (b = 4,N=10)CBAKKIJK version (small cache)DO I = 1, N DO J = 1, N DO K = 1, N C(I,J) = C(I,J) + A(I,K)*B(K,J)Small cache scenario: Matrices are large compared to cachereuse distance is not O(1) => missCold and capacity misses Miss ratio: C: N2/b misses (good temporal locality)A: N3 /b misses (good spatial locality)B: N3 misses (poor temporal and spatial locality)Miss ratio 0.25 (b+1)/b = 0.3125 (for b = 4)CBAKKMMM experiments L1 Cache Miss Ratio for Intel Pentium III–MMM with N = 1…1300–16KB 32B/Block 4-way 8-byte elementsCan we predict this?How large can matrices be and still not suffer capacity misses?DO I = 1, M DO J = 1, N DO K = 1, P C(I,J) = C(I,J) + A(I,K)*B(K,J)How large can these matrices be without suffering capacity misses?Each iteration of outermost loop walks over entire B matrix, so all of B must be in cacheWe walk over rows of A and successive iterations of middle loop touch same row of A, so one row of A must be in cacheWe walk over elements of C one at a time.So inequality is NP + P + 1 <= CCBAKKMNPCheck with experimentFor our machine, capacity of L1 cache is 16KB/8 doubles = 211 doublesIf matrices are square, we must solve N^2 + N + 1 = 211 which gives us N = 45This agrees well with experiment.High-level picture of high-performance MMM codeBlock the code for each level of memory hierarchyRegistersL1 cache…..Choose block sizes at each level using the theory described previouslyUseful optimization: choose block size at level L+1 to be multiple of the block size at level LATLASLibrary generator for MMM and other BLASBlocks only for registers and L1 cacheUses search to determine block sizes, rather than the analytical formulas we usedSearch takes more time, but we do it once when library is producedLet us study structure of ATLAS in little more detailOriginal ATLAS InfrastructureModel-Based ATLAS InfrastructureOur approachDetectHardwareParametersATLAS SearchEngine(MMSearch)NRMulAddL*L1SizeATLAS MMCode Generator(MMCase)xFetchMulAddLatencyNBMU,NU,KUMiniMMMSourceCompile,Execute,MeasureMFLOPSDetectHardwareParametersModelNRMulAddL*L1I$SizeATLAS MMCode Generator(MMCase)xFetchMulAddLatencyNBMU,NU,KUMiniMMMSourceL1SizeBLASLet us focus on MMM: for (int i = 0; i < M; i++) for (int j = 0; j < N; j++) for (int k = 0; k < K; k++) C[i][j] += A[i][k]*B[k][j]PropertiesVery good reuse: O(N2) data, O(N3) computationMany optimization opportunitiesFew “real” dependenciesWill run poorly on modern machinesPoor use of cache and registersPoor use of processor pipelinesOptimizationsCache-level blocking (tiling)Atlas blocks only for L1 cacheNB: L1 cache time sizeRegister-level blockingImportant to hold array values in registersMU,NU: register tile sizeSoftware pipeliningUnroll and schedule operationsLatency, xFetch: scheduling parametersVersioningDynamically decide which way to computeBack-end compiler optimizationsScalar OptimizationsInstruction SchedulingCache-level blocking (tiling)Tiling in ATLASOnly square tiles (NBxNBxNB)Working set of tile fits in L1Tiles are usually copied to continuous storageSpecial “clean-up” code generated for boundariesMini-MMMfor (int j = 0; j < NB; j++) for (int i = 0; i < NB; i++) for (int k = 0; k < NB; k++) C[i][j] += A[i][k] * B[k][j]NB: Optimization parameterRegister-level blockingMicro-MMMA: MUx1B: 1xNUC: MUxNUMUxNU+MU+NU registersUnroll loops by MU, NU, and KUMini-MMM with Micro-MMM insidefor (int j = 0; j < NB; j += NU) for (int i = 0; i < NB; i += MU) load C[i..i+MU-1, j..j+NU-1] into registers for (int k = 0; k < NB; k++) load A[i..i+MU-1,k] into registers load B[k,j..j+NU-1] into registers multiply A’s and B’s and add to C’s store C[i..i+MU-1, j..j+NU-1]Special clean-up code required if NB is not a multiple of MU,NU,KUMU, NU, KU: optimization parametersKU timesSchedulingFMA Present?Schedule ComputationUsing LatencySchedule Memory OperationsUsing IFetch, NFetch,FFetchLatency, xFetch: optimization parametersM1M2M3M4MMU*NU…A1A2A3A4AMU*NU…L1L2L3LMU+NU…Latency=2A1A2AMU*NU…ComputationMemoryOperationsComputationMemoryOperationsComputationMemoryOperationsComputationMemoryOperationsComputationMemoryOperationsIFetch LoadsNFetch LoadsNFetch LoadsNFetch Loads…Search StrategyMulti-dimensional optimization problem:Independent parameters: NB,MU,NU,KU,…Dependent variable: MFlopsFunction from parameters to variables is given implicitly; can be evaluated repeatedlyOne optimization strategy: orthogonal line searchOptimize along one dimension at a time, using reference values for parameters not yet optimizedNot guaranteed to find optimal point, but might come closeFind Best NBSearch in following range16 <= NB <= 80NB2 <= L1SizeIn this search, use simple estimates for other parameters(eg) KU: Test each candidate forFull
View Full Document