DOC PREVIEW
Berkeley COMPSCI C267 - High Performance Programming on a Single Processor

This preview shows page 1-2-3-23-24-25-26-46-47-48 out of 48 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 48 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 48 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 48 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 48 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 48 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 48 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 48 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 48 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 48 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 48 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 48 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

High Performance Programming on a Single Processor: Memory Hierarchies Matrix Multiplication Automatic Performance TuningOutlineMatrix-multiply, optimized several waysPerformance TuningSearch Over Block Sizes. Eg in MatmulWhat part of the Matmul Search Space Looks LikeATLAS (DGEMM n = 500)Tiling Alone Might Not Be EnoughOptimizing in PracticeRemoving False DependenciesExploit Multiple RegistersLoop UnrollingExpose Independent OperationsCopy optimizationLocality in Other AlgorithmsOther Automatic Tuning EffortsSummary of Performance TuningExtra SlidesReading for TodayQuestions You Should Be Able to AnswerReading AssignmentSlide 23Microprocessor revolutionA Parallel Computer Today: NERSC-3 Vital StatisticsPerformance Levels (for example on NERSC-3)Millennium and CITRISSlide 28SpeedupEfficiencySuperlinear SpeedupAmdahl’s LawAmdahl’s Law (for 1024 processors)Scaled SpeedupScaled EfficiencyThree Definitions of Efficiency: SummaryLimits of Scaling – an example of a current debatePerformance LimitsSlide 40Principles of Parallel ComputingOverhead of ParallelismLocality and ParallelismLoad ImbalancePerformance Programming is ChallengingLimits to Instruction Level Parallelism (ILP)Dependences (Data Hazards) Limit ParallelismSlide 48Little’s Law“Naïve” Matrix MultiplyCS267 Lecture 2a101/19/06High Performance Programming on a Single Processor: Memory HierarchiesMatrix MultiplicationAutomatic Performance TuningJames [email protected] www.cs.berkeley.edu/~demmel/cs267_Spr0601/19/06 CS267 Lecture 2a2Outline•Idealized and actual costs in modern processors•Memory hierarchies•Case Study: Matrix Multiplication•Automatic Performance Tuning01/19/06 CS267 Lecture 2a3Matrix-multiply, optimized several waysSpeed of n-by-n matrix multiply on Sun Ultra-1/170, peak = 330 MFlops01/19/06 CS267 Lecture 2a4Performance Tuning•Why can’t compilers always choose fastest code?•Difficulty of accurate performance modeling at compile time•Remember performance complexity of simple Memory Benchmark!•Large set of possible code variations to consider, with different•Data structures•Algorithms (perhaps with (slightly) different answers)•May not know everything needed to tune until run-time•Ex: structure of a sparse matrix•Architectures change faster than compilers•Consequences•Compiler can’t duplicate expertise of user•You need to know basics of writing efficient code•You need to recognize previously well-solved problems•Some important kernels (eg matmul) are automatically tuned•Homework (will be on web page)•Optimizing matrix multiply•Work in (assigned) interdisciplinary teams•Race against one another, “best practice”01/19/06 CS267 Lecture 2a5Search Over Block Sizes. Eg in Matmul•Performance models are useful for high level algorithms•Helps in developing a blocked algorithm•Models have not proven very useful for block size selection•too complicated to be useful–See work by Sid Chatterjee for detailed model•too simple to be accurate–Multiple multidimensional arrays, virtual memory, etc.•Speed depends on matrix dimensions, details of code, compiler, processor•Some systems use search over “design space” of possible implementations•Measure each implementation in design space•PHiPAC: www.icsi.berkeley.edu/~bilmes/phipac in particular tr-98-035.ps.gz•ATLAS: www.netlib.org/atlas•Part of Matlab, some vendor libraries01/19/06 CS267 Lecture 2a6What part of the Matmul Search Space Looks LikeA 2-D slice of a 3-D register-tile search space. The dark blue region was pruned.(Platform: Sun Ultra-IIi, 333 MHz, 667 Mflop/s peak, Sun cc v5.0 compiler)Number of columns in register blockNumber of rows in register block01/19/06 CS267 Lecture 2a7ATLAS (DGEMM n = 500)•ATLAS is faster than all other portable BLAS implementations and it is comparable with machine-specific libraries provided by the vendor.0.0100.0200.0300.0400.0500.0600.0700.0800.0900.0ArchitecturesMFLOP SVendor BLASATLAS BLASF77 BLASSource: Jack Dongarra01/19/06 CS267 Lecture 2a8Tiling Alone Might Not Be Enough•Naïve and a “naïvely tiled” code on Itanium 2•Searched all block sizes to find best, b=56•Starting point for next homework020040060080010001200140016000 200 400 600 800Matrix dimensionMFlop/s3 loopsblocked, b=5601/19/06 CS267 Lecture 2a9Optimizing in Practice•Tiling for registers•loop unrolling, use of named “register” variables•Tiling for multiple levels of cache and TLB•Exploiting fine-grained parallelism in processor•superscalar; pipelining•Complicated compiler interactions•Practice on homework!•See “Performance Optimization of Numerically Intensive Codes”, Goedecker and Hoisie, SIAM 200101/19/06 CS267 Lecture 2a10Removing False Dependencies•Using local variables, reorder operations to remove false dependenciesa[i] = b[i] + c;a[i+1] = b[i+1] * d;float f1 = b[i];float f2 = b[i+1];a[i] = f1 + c;a[i+1] = f2 * d;false read-after-write hazardbetween a[i] and b[i+1]With some compilers, you can declare a and b unaliased.•Done via “restrict pointers,” compiler flag, or pragma)01/19/06 CS267 Lecture 2a11Exploit Multiple Registers•Reduce demands on memory bandwidth by pre-loading into local variableswhile( … ) { *res++ = filter[0]*signal[0] + filter[1]*signal[1] + filter[2]*signal[2]; signal++;}float f0 = filter[0];float f1 = filter[1];float f2 = filter[2];while( … ) { *res++ = f0*signal[0] + f1*signal[1] + f2*signal[2]; signal++;}also: register float f0 = …;Example is a convolution01/19/06 CS267 Lecture 2a13Loop Unrolling•Expose instruction-level parallelismfloat f0 = filter[0], f1 = filter[1], f2 = filter[2];float s0 = signal[0], s1 = signal[1], s2 = signal[2];*res++ = f0*s0 + f1*s1 + f2*s2;do { signal += 3; s0 = signal[0]; res[0] = f0*s1 + f1*s2 + f2*s0; s1 = signal[1]; res[1] = f0*s2 + f1*s0 + f2*s1; s2 = signal[2]; res[2] = f0*s0 + f1*s1 + f2*s2; res += 3;} while( … );01/19/06 CS267 Lecture 2a14Expose Independent Operations•Hide instruction latency•Use local variables to expose independent operations that can execute in parallel or in a pipelined fashion•Balance the instruction mix (what functional units are available?)f1 = f5 * f9;f2 = f6 + f10;f3 = f7 * f11;f4 = f8 + f12;01/19/06 CS267 Lecture 2a15Copy optimization•Copy input operands or blocks•Reduce cache conflicts•Constant array offsets for fixed size blocks•Expose


View Full Document

Berkeley COMPSCI C267 - High Performance Programming on a Single Processor

Documents in this Course
Lecture 4

Lecture 4

52 pages

Split-C

Split-C

5 pages

Lecture 5

Lecture 5

40 pages

Load more
Download High Performance Programming on a Single Processor
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view High Performance Programming on a Single Processor and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view High Performance Programming on a Single Processor 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?