Automatic Performance Tuning of Numerical Kernels BeBOP: Berkeley Benchmarking and OPtimizationPerformance Tuning ParticipantsOutlineMotivation History Related WorkPerformance TuningConventional Performance TuningAutomatic Performance TuningSome Automatic Tuning ProjectsAn Example: Dense Matrix MultiplyTuning Techniques in PHIPACUntiled (“Naïve”) n x n Matrix MultiplyUntiled (“Naïve”) n x n Matrix Multiply - AnalysisTiled (Blocked) Matrix MultiplyTiled (Blocked) Matrix Multiply - AnalysisPHIPAC: Copy optimizationPHIPAC: Removing False DependenciesPHIPAC: Exploiting Multiple RegistersPHIPAC: Minimizing Pointer UpdatesPHIPAC: Loop UnrollingPHIPAC: Software Pipelining of LoopsPHIPAC: Exposing Independent OperationsTuning pays off – PHIPAC (Bilmes, Asanovic,Demmel)Tuning pays off – ATLAS (Dongarra, Whaley)Search for optimal register tile sizes on Sun Ultra 10Search for Optimal L0 block size in dense matmulHigh precision dense mat-vec multiply (XBLAS)High Precision Algorithms (XBLAS)Tuning pays off – FFTW (Frigo, Johnson)Tuning Sparse Matrix ComputationsMatrix #2 – raefsky3 (FEM/Fluids)Matrix #22 (cont’d)- bayer02 (chemical process simulation)Matrix #40 – gupta1 (linear programming)Matrix #40 (cont’d) – gupta1 (linear programming)Tuning Sparse matrix-vector multiplyHow Sparsity tunes y = A*xRegister-Blocked Performance of SPMV on Dense Matrices (up to 12x12)Which other sparse operations can we tune?Test MatricesResults on Sun Ultra 1/170Speedups on SPMV from Sparsity on Sun Ultra 1/170 – 1 RHSSpeedups on SPMV from Sparsity on Sun Ultra 1/170 – 9 RHSSpeed up from Cache Blocking on LSI matrix on Sun UltraRecent Results on P4 using icc and gccSpeedup of SPMV from Sparsity on P4/icc-5.0.1Single vector speedups on P4 by matrix type – best r x cPerformance of SPMV from Sparsity on P4/icc-5.0.1Speed up from Cache Blocking on LSI matrix on P4Fill for SPMV from Sparsity on P4/icc-5.0.1Multiple vector speedups on P4Multiple vector speedups on P4 – by matrix typeMultiple Vector Performance on P4Symmetric Sparse Matrix-Vector Multiply on P4 (vs naïve full = 1)Sparse Triangular Solve (Matlab’s colmmd ordering) on P4AT*A on P4 (Accesses A only once)Preliminary Results on Itanium using eccSpeedup of SPMV from Sparsity on Itanium/ecc-5.0.1Single vector speedups on Itanium by matrix typeRaw Performance of SPMV from Sparsity on ItaniumFill for SPMV from Sparsity on ItaniumImprovements to register block size selectionMultiple vector speedups on ItaniumMultiple vector speedups on Itanium – by matrix typeMultiple Vector Performance on ItaniumSpeed up from Cache Blocking on LSI matrix on ItaniumFuture WorkBackground on Test MatricesSparse Matrix Benchmark Suite (1/3)Sparse Matrix Benchmark Suite (2/3)Sparse Matrix Benchmark Suite (3/3)Slide 70Matrix #2 (cont’d) – raefsky3 (FEM/Fluids)Matrix #22 - bayer02 (chemical process simulation)Slide 73Matrix #27 - pwt (structural engineering)Matrix #27 (cont’d)- pwt (structural engineering)Matrix #29 – wang4 (semiconductor device simulation)Matrix #29 (cont’d)-wang4 (seminconductor device sim.)Slide 78Slide 79Symmetric Matrix Benchmark SuiteLower Triangular Matrix Benchmark SuiteLower triangular factor: Matrix #2 – ex11Lower triangular factor: Matrix #3 - goodwinLower triangular factor: Matrix #4 – lhr10Automatic Performance TuningAutomatic Performance Tuningof Numerical Kernels of Numerical Kernels BeBOP: Berkeley Benchmarking and BeBOP: Berkeley Benchmarking and OPtimizationOPtimizationJames DemmelEECS and MathUC BerkeleyKatherine YelickEECSUC BerkeleySupport from NSF, DOE SciDAC, IntelPerformance Tuning ParticipantsPerformance Tuning ParticipantsOther FacultyMichael Jordan, William Kahan, Zhaojun Bai (UCD)ResearchersMark Adams (SNL), David Bailey (LBL), Parry Husbands (LBL), Xiaoye Li (LBL), Lenny Oliker (LBL)PhD StudentsRich Vuduc, Yozo Hida, Geoff PikeUndergradsBrian Gaeke , Jen Hsu, Shoaib Kamil, Suh Kang, Hyun Kim, Gina Lee, Jaeseop Lee, Michael de Lorimier, Jin Moon, Randy Shoopman, Brandon ThompsonOutlineOutlineMotivation, History, Related workTuning Dense Matrix OperationsTuning Sparse Matrix OperationsResults on Sun Ultra 1/170Recent results on P4Recent results on ItaniumFuture WorkMotivationMotivationHistoryHistoryRelated WorkRelated WorkPerformance TuningPerformance Tuning Motivation: performance of many applications dominated by a few kernelsCAD Nonlinear ODEs Nonlinear equations Linear equations Matrix multiplyMatrix-by-matrix or matrix-by-vectorDense or SparseInformation retrieval by LSI Compress term-document matrix … Sparse mat-vec multiplyMany other examples (not all linear algebra)Conventional Performance TuningConventional Performance Tuning Motivation: performance of many applications dominated by a few kernels Vendor or user hand tunes kernels Drawbacks:Very time consuming and tedious workEven with intimate knowledge of architecture and compiler, performance hard to predictGrowing list of kernels to tuneExample: New BLAS (Basic Linear Algebra Subroutines) StandardMust be redone for every architecture, compilerCompiler technology often lags architectureNot just a compiler problem: Best algorithm may depend on input, so some tuning at run-time.Not all algorithms semantically or mathematically equivalentAutomatic Performance TuningAutomatic Performance TuningApproach: for each kernel1.Identify and generate a space of algorithms2.Search for the fastest one, by running themWhat is a space of algorithms?Depending on kernel and input, may varyinstruction mix and ordermemory access patternsdata structures mathematical formulation When do we search?Once per kernel and architecture At compile timeAt run timeAll of the aboveSome Automatic Tuning ProjectsSome Automatic Tuning ProjectsPHIPAC (www.icsi.berkeley.edu/~bilmes/phipac) (Bilmes,Asanovic,Vuduc,Demmel)ATLAS (www.netlib.org/atlas) (Dongarra, Whaley; in Matlab)XBLAS (www.nersc.gov/~xiaoye/XBLAS) (Demmel, X. Li)Sparsity (www.cs.berkeley.edu/~yelick/sparsity) (Yelick, Im) FFTs and Signal ProcessingFFTW (www.fftw.org)Won 1999 Wilkinson Prize for Numerical SoftwareSPIRAL (www.ece.cmu.edu/~spiral)Extensions to other transforms, DSPsUHFFT Extensions to higher dimension, parallelismSpecial session at ICCS 2001Organized by Yelick and
View Full Document