UT CS 378 - Optimizing MMM & ATLAS Library Generator - D205524

Home> Schools> University of Texas at Austin> Computer Science (CS) > CS 378> Optimizing MMM & ATLAS Library Generator

DOC PREVIEW

UT CS 378 - Optimizing MMM & ATLAS Library Generator

School name University of Texas at Austin

Course Cs 378- Declarative Programming

Pages 34

This preview shows page 1-2-16-17-18-33-34 out of 34 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 34 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 34 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 34 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 34 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 34 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 34 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 34 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 34 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Optimizing MMM & ATLAS Library GeneratorRecall: MMM miss ratiosIJK version (large cache)IJK version (small cache)MMM experimentsHow large can matrices be and still not suffer capacity misses?Check with experimentHigh-level picture of high-performance MMM codeATLASOur approachBLASOptimizationsCache-level blocking (tiling)Register-level blockingSchedulingSearch StrategyFind Best NBModel-based optimizationModeling for Optimization ParametersLargest NB for no capacity/conflict missesLargest NB for no capacity missesSummary: Modeling for Tile Size (NB)Summary of modelExperimentsSome sensitivity graphs for Alpha 21264Eliminating performance gapsSmall performance gap: Alpha 21264Large performance gap: ItaniumItanium diagnosis and solutionLarge performance gap: OpteronOpteron diagnosis and solutionRefined modelBottom lineFuture DirectionsOptimizing MMM & ATLAS Library GeneratorRecall: MMM miss ratios L1 Cache Miss Ratio for Intel Pentium III–MMM with N = 1…1300–16KB 32B/Block 4-way 8-byte elementsIJK version (large cache)DO I = 1, N//row-major storage DO J = 1, N DO K = 1, N C(I,J) = C(I,J) + A(I,K)*B(K,J)Large cache scenario:Matrices are small enough to fit into cacheOnly cold misses, no capacity missesMiss ratio: Data size = 3 N2 Each miss brings in b floating-point numbersMiss ratio = 3 N2 /b*4N3 = 0.75/bN = 0.019 (b = 4,N=10)CBAKKIJK version (small cache)DO I = 1, N DO J = 1, N DO K = 1, N C(I,J) = C(I,J) + A(I,K)*B(K,J)Small cache scenario: Matrices are large compared to cachereuse distance is not O(1) => missCold and capacity misses Miss ratio: C: N2/b misses (good temporal locality)A: N3 /b misses (good spatial locality)B: N3 misses (poor temporal and spatial locality)Miss ratio  0.25 (b+1)/b = 0.3125 (for b = 4)CBAKKMMM experiments L1 Cache Miss Ratio for Intel Pentium III–MMM with N = 1…1300–16KB 32B/Block 4-way 8-byte elementsCan we predict this?How large can matrices be and still not suffer capacity misses?DO I = 1, M DO J = 1, N DO K = 1, P C(I,J) = C(I,J) + A(I,K)*B(K,J)How large can these matrices be without suffering capacity misses?Each iteration of outermost loop walks over entire B matrix, so all of B must be in cacheWe walk over rows of A and successive iterations of middle loop touch same row of A, so one row of A must be in cacheWe walk over elements of C one at a time.So inequality is NP + P + 1 <= CCBAKKMNPCheck with experimentFor our machine, capacity of L1 cache is 16KB/8 doubles = 211 doublesIf matrices are square, we must solve N^2 + N + 1 = 211 which gives us N = 45This agrees well with experiment.High-level picture of high-performance MMM codeBlock the code for each level of memory hierarchyRegistersL1 cache…..Choose block sizes at each level using the theory described previouslyUseful optimization: choose block size at level L+1 to be multiple of the block size at level LATLASLibrary generator for MMM and other BLASBlocks only for registers and L1 cacheUses search to determine block sizes, rather than the analytical formulas we usedSearch takes more time, but we do it once when library is producedLet us study structure of ATLAS in little more detailOriginal ATLAS InfrastructureModel-Based ATLAS InfrastructureOur approachDetectHardwareParametersATLAS SearchEngine(MMSearch)NRMulAddL*L1SizeATLAS MMCode Generator(MMCase)xFetchMulAddLatencyNBMU,NU,KUMiniMMMSourceCompile,Execute,MeasureMFLOPSDetectHardwareParametersModelNRMulAddL*L1I$SizeATLAS MMCode Generator(MMCase)xFetchMulAddLatencyNBMU,NU,KUMiniMMMSourceL1SizeBLASLet us focus on MMM: for (int i = 0; i < M; i++) for (int j = 0; j < N; j++) for (int k = 0; k < K; k++) C[i][j] += A[i][k]*B[k][j]PropertiesVery good reuse: O(N2) data, O(N3) computationMany optimization opportunitiesFew “real” dependenciesWill run poorly on modern machinesPoor use of cache and registersPoor use of processor pipelinesOptimizationsCache-level blocking (tiling)Atlas blocks only for L1 cacheNB: L1 cache time sizeRegister-level blockingImportant to hold array values in registersMU,NU: register tile sizeSoftware pipeliningUnroll and schedule operationsLatency, xFetch: scheduling parametersVersioningDynamically decide which way to computeBack-end compiler optimizationsScalar OptimizationsInstruction SchedulingCache-level blocking (tiling)Tiling in ATLASOnly square tiles (NBxNBxNB)Working set of tile fits in L1Tiles are usually copied to continuous storageSpecial “clean-up” code generated for boundariesMini-MMMfor (int j = 0; j < NB; j++) for (int i = 0; i < NB; i++) for (int k = 0; k < NB; k++) C[i][j] += A[i][k] * B[k][j]NB: Optimization parameterRegister-level blockingMicro-MMMA: MUx1B: 1xNUC: MUxNUMUxNU+MU+NU registersUnroll loops by MU, NU, and KUMini-MMM with Micro-MMM insidefor (int j = 0; j < NB; j += NU) for (int i = 0; i < NB; i += MU) load C[i..i+MU-1, j..j+NU-1] into registers for (int k = 0; k < NB; k++) load A[i..i+MU-1,k] into registers load B[k,j..j+NU-1] into registers multiply A’s and B’s and add to C’s store C[i..i+MU-1, j..j+NU-1]Special clean-up code required if NB is not a multiple of MU,NU,KUMU, NU, KU: optimization parametersKU timesSchedulingFMA Present?Schedule ComputationUsing LatencySchedule Memory OperationsUsing IFetch, NFetch,FFetchLatency, xFetch: optimization parametersM1M2M3M4MMU*NU…A1A2A3A4AMU*NU…L1L2L3LMU+NU…Latency=2A1A2AMU*NU…ComputationMemoryOperationsComputationMemoryOperationsComputationMemoryOperationsComputationMemoryOperationsComputationMemoryOperationsIFetch LoadsNFetch LoadsNFetch LoadsNFetch Loads…Search StrategyMulti-dimensional optimization problem:Independent parameters: NB,MU,NU,KU,…Dependent variable: MFlopsFunction from parameters to variables is given implicitly; can be evaluated repeatedlyOne optimization strategy: orthogonal line searchOptimize along one dimension at a time, using reference values for parameters not yet optimizedNot guaranteed to find optimal point, but might come closeFind Best NBSearch in following range16 <= NB <= 80NB2 <= L1SizeIn this search, use simple estimates for other parameters(eg) KU: Test each candidate forFull

View Full Document