Berkeley COMPSCI 258 - Berkeley Benchmarking and OPtimization

Unformatted text preview:

Automatic Performance Tuning of Numerical Kernels BeBOP: Berkeley Benchmarking and OPtimizationPerformance Tuning ParticipantsOutlineMotivation History Related WorkPerformance TuningConventional Performance TuningAutomatic Performance TuningSome Automatic Tuning ProjectsAn Example: Dense Matrix MultiplyTuning Techniques in PHIPACUntiled (“Naïve”) n x n Matrix MultiplyUntiled (“Naïve”) n x n Matrix Multiply - AnalysisTiled (Blocked) Matrix MultiplyTiled (Blocked) Matrix Multiply - AnalysisPHIPAC: Copy optimizationPHIPAC: Removing False DependenciesPHIPAC: Exploiting Multiple RegistersPHIPAC: Minimizing Pointer UpdatesPHIPAC: Loop UnrollingPHIPAC: Software Pipelining of LoopsPHIPAC: Exposing Independent OperationsTuning pays off – PHIPAC (Bilmes, Asanovic,Demmel)Tuning pays off – ATLAS (Dongarra, Whaley)Search for optimal register tile sizes on Sun Ultra 10Search for Optimal L0 block size in dense matmulHigh precision dense mat-vec multiply (XBLAS)High Precision Algorithms (XBLAS)Tuning pays off – FFTW (Frigo, Johnson)Tuning Sparse Matrix ComputationsMatrix #2 – raefsky3 (FEM/Fluids)Matrix #22 (cont’d)- bayer02 (chemical process simulation)Matrix #40 – gupta1 (linear programming)Matrix #40 (cont’d) – gupta1 (linear programming)Tuning Sparse matrix-vector multiplyHow Sparsity tunes y = A*xRegister-Blocked Performance of SPMV on Dense Matrices (up to 12x12)Which other sparse operations can we tune?Test MatricesResults on Sun Ultra 1/170Speedups on SPMV from Sparsity on Sun Ultra 1/170 – 1 RHSSpeedups on SPMV from Sparsity on Sun Ultra 1/170 – 9 RHSSpeed up from Cache Blocking on LSI matrix on Sun UltraRecent Results on P4 using icc and gccSpeedup of SPMV from Sparsity on P4/icc-5.0.1Single vector speedups on P4 by matrix type – best r x cPerformance of SPMV from Sparsity on P4/icc-5.0.1Speed up from Cache Blocking on LSI matrix on P4Fill for SPMV from Sparsity on P4/icc-5.0.1Multiple vector speedups on P4Multiple vector speedups on P4 – by matrix typeMultiple Vector Performance on P4Symmetric Sparse Matrix-Vector Multiply on P4 (vs naïve full = 1)Sparse Triangular Solve (Matlab’s colmmd ordering) on P4AT*A on P4 (Accesses A only once)Preliminary Results on Itanium using eccSpeedup of SPMV from Sparsity on Itanium/ecc-5.0.1Single vector speedups on Itanium by matrix typeRaw Performance of SPMV from Sparsity on ItaniumFill for SPMV from Sparsity on ItaniumImprovements to register block size selectionMultiple vector speedups on ItaniumMultiple vector speedups on Itanium – by matrix typeMultiple Vector Performance on ItaniumSpeed up from Cache Blocking on LSI matrix on ItaniumFuture WorkBackground on Test MatricesSparse Matrix Benchmark Suite (1/3)Sparse Matrix Benchmark Suite (2/3)Sparse Matrix Benchmark Suite (3/3)Slide 70Matrix #2 (cont’d) – raefsky3 (FEM/Fluids)Matrix #22 - bayer02 (chemical process simulation)Slide 73Matrix #27 - pwt (structural engineering)Matrix #27 (cont’d)- pwt (structural engineering)Matrix #29 – wang4 (semiconductor device simulation)Matrix #29 (cont’d)-wang4 (seminconductor device sim.)Slide 78Slide 79Symmetric Matrix Benchmark SuiteLower Triangular Matrix Benchmark SuiteLower triangular factor: Matrix #2 – ex11Lower triangular factor: Matrix #3 - goodwinLower triangular factor: Matrix #4 – lhr10Automatic Performance TuningAutomatic Performance Tuningof Numerical Kernels of Numerical Kernels BeBOP: Berkeley Benchmarking and BeBOP: Berkeley Benchmarking and OPtimizationOPtimizationJames DemmelEECS and MathUC BerkeleyKatherine YelickEECSUC BerkeleySupport from NSF, DOE SciDAC, IntelPerformance Tuning ParticipantsPerformance Tuning ParticipantsOther FacultyMichael Jordan, William Kahan, Zhaojun Bai (UCD)ResearchersMark Adams (SNL), David Bailey (LBL), Parry Husbands (LBL), Xiaoye Li (LBL), Lenny Oliker (LBL)PhD StudentsRich Vuduc, Yozo Hida, Geoff PikeUndergradsBrian Gaeke , Jen Hsu, Shoaib Kamil, Suh Kang, Hyun Kim, Gina Lee, Jaeseop Lee, Michael de Lorimier, Jin Moon, Randy Shoopman, Brandon ThompsonOutlineOutlineMotivation, History, Related workTuning Dense Matrix OperationsTuning Sparse Matrix OperationsResults on Sun Ultra 1/170Recent results on P4Recent results on ItaniumFuture WorkMotivationMotivationHistoryHistoryRelated WorkRelated WorkPerformance TuningPerformance Tuning Motivation: performance of many applications dominated by a few kernelsCAD  Nonlinear ODEs  Nonlinear equations  Linear equations  Matrix multiplyMatrix-by-matrix or matrix-by-vectorDense or SparseInformation retrieval by LSI  Compress term-document matrix  …  Sparse mat-vec multiplyMany other examples (not all linear algebra)Conventional Performance TuningConventional Performance Tuning Motivation: performance of many applications dominated by a few kernels Vendor or user hand tunes kernels Drawbacks:Very time consuming and tedious workEven with intimate knowledge of architecture and compiler, performance hard to predictGrowing list of kernels to tuneExample: New BLAS (Basic Linear Algebra Subroutines) StandardMust be redone for every architecture, compilerCompiler technology often lags architectureNot just a compiler problem: Best algorithm may depend on input, so some tuning at run-time.Not all algorithms semantically or mathematically equivalentAutomatic Performance TuningAutomatic Performance TuningApproach: for each kernel1.Identify and generate a space of algorithms2.Search for the fastest one, by running themWhat is a space of algorithms?Depending on kernel and input, may varyinstruction mix and ordermemory access patternsdata structures mathematical formulation When do we search?Once per kernel and architecture At compile timeAt run timeAll of the aboveSome Automatic Tuning ProjectsSome Automatic Tuning ProjectsPHIPAC (www.icsi.berkeley.edu/~bilmes/phipac) (Bilmes,Asanovic,Vuduc,Demmel)ATLAS (www.netlib.org/atlas) (Dongarra, Whaley; in Matlab)XBLAS (www.nersc.gov/~xiaoye/XBLAS) (Demmel, X. Li)Sparsity (www.cs.berkeley.edu/~yelick/sparsity) (Yelick, Im) FFTs and Signal ProcessingFFTW (www.fftw.org)Won 1999 Wilkinson Prize for Numerical SoftwareSPIRAL (www.ece.cmu.edu/~spiral)Extensions to other transforms, DSPsUHFFT Extensions to higher dimension, parallelismSpecial session at ICCS 2001Organized by Yelick and


View Full Document

Berkeley COMPSCI 258 - Berkeley Benchmarking and OPtimization

Documents in this Course
Load more
Download Berkeley Benchmarking and OPtimization
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Berkeley Benchmarking and OPtimization and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Berkeley Benchmarking and OPtimization 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?