COSC 6374 Parallel Computation Performance Oriented Software Design Edgar Gabriel Spring 2009 Edgar Gabriel Amdahl s Law Describes the performance gains by enhancing one part of the overall system code computer Performance of entire task using the enhancement Speedup Performance of entire task not using the enhancement Or Execution time of the task not using the enhancement Speedup Execution time of the task using the enhancement COSC 6374 Parallel Computation Edgar Gabriel 1 Amdahl s Law II Amdahl s Law depends on two factors Fraction of the execution time affected by enhancement The improvement gained by the enhancement for this fraction Thus Execution timeenh Execution timeorg 1 Fractionenh Speedupoverall Execution timeorg Execution timeenh Fractionenh 1 27 1 Speedupenh 1 1 Fractionenh Fractionenh Speedupenh 1 27 2 COSC 6374 Parallel Computation Edgar Gabriel Amdahl s Law III 1 Speedupoverall 1 Fractionenh Fractionenh Speedupenh 6 5 Fraction enhanced 20 Speedup total 4 Fraction enhanced 40 Fraction enhanced 60 3 Fraction enhanced 80 2 1 0 0 20 40 60 80 100 Speedup enhanced COSC 6374 Parallel Computation Edgar Gabriel 2 Amdahl s Law IV Speedup according to Amdahl s Law 12 Speedup total 10 8 Speedup enhanced 2 6 Speedup enhanced 4 Speedup enhanced 10 4 2 0 0 0 2 0 4 0 6 0 8 1 Fraction enhanced COSC 6374 Parallel Computation Edgar Gabriel Three big questions Which are the most time consuming code section How efficient are those routines What are the reasons for inefficiency COSC 6374 Parallel Computation Edgar Gabriel 3 Which are the most time consuming code section Need to profile the application Standard tools in UNIX like environments gprof valgrind Valgrind Collection of various tools to analyze an application at runtime tool memcheck memory debugger tool cachegrind estimate on the cache usage of an application tool callgrind provides a trace of the function calls Most tools produce an output file cachegrind procid out kcachegrind visualization tool of valgrind output files COSC 6374 Parallel Computation Edgar Gabriel COSC 6374 Parallel Computation Edgar Gabriel 4 How to determine the sources of overhead Get detailed data for different sections of the routine get an estimate on the number of operations executed within these section Scaling issues For each process we might end up with a large no of time stamps e g k per process a large no of measurements per time stamp e g m per time stamp Execution time of MPI functions various PAPI counters user defined values This leads to n k m data values for the performance analysis COSC 6374 Parallel Computation Edgar Gabriel Data reduction for performance Analysis Data reduction for the number of processes analyzed Find processors exposing the same behavior and focus on the performance analysis of a single processor of each group Data reduction per process Eliminate the measurements exposing the same information Data reduction in time Find a small typical cycle in the application and ignore the rest Automatic statistical methods inevitable cluster analysis COSC 6374 Parallel Computation Edgar Gabriel 5 Where do we loose efficiency gabriel sharj valgrind tool cachegrind ATF 27050 27050 I refs 7 477 574 763 27050 I1 misses 1 856 27050 L2i misses 1 774 27050 I1 miss rate 0 00 27050 L2i miss rate 0 00 27050 27050 D refs 3 663 973 777 3 517 790 756 rd 27050 D1 misses 89 705 595 85 089 836 rd 27050 L2d misses 85 614 772 81 648 115 rd 27050 D1 miss rate 2 4 2 4 27050 L2d miss rate 2 3 2 3 27050 27050 L2 refs 89 707 451 85 091 692 rd 27050 L2 misses 85 616 546 81 649 889 rd 27050 L2 miss rate 0 7 0 7 146 183 021 wr 4 615 759 wr 3 966 657 wr 3 1 2 7 4 615 759 wr 3 966 657 wr 2 7 COSC 6374 Parallel Computation Edgar Gabriel COSC 6374 Parallel Computation Edgar Gabriel 6 COSC 6374 Parallel Computation Edgar Gabriel PAPI hardware performance counters Modern processors expose a some counters which give some information about the performance Limited number of counters No of simultaneous counters and the supported combination of hardware counters depending on the processor Available on most modern operating systems Linux requires recompiling the kernel Windows works right away however not very accurate due to some restrictions of the OS on context switches Requires modification of your source code to insert the PAPI calls COSC 6374 Parallel Computation Edgar Gabriel 7 General Counters PAPI FP OPS PAPI TOT CYC PAPI HW INT COSC 6374 Parallel Computation Edgar Gabriel Floating point operations Total cycles Hardware interrupts Slides based on a talk and courtesy of Andreas Knuepfer and Matthias Mueller Center for Information Services and High Performance Computing Technical University Dresden Instruction Counters PAPI TOT IIS PAPI TOT INS PAPI INT INS PAPI LD INS PAPI SR INS PAPI BR INS PAPI VEC INS PAPI LST INS PAPI SYC INS COSC 6374 Parallel Computation Edgar Gabriel Instructions issued Instructions completed Integer instructions Load instructions Store instructions Branch instructions Vector SIMD instructions Load store instr completed Synch instr completed Slides based on a talk and courtesy of Andreas Knuepfer and Matthias Mueller Center for Information Services and High Performance Computing Technical University Dresden 8 FP Instruction Counters PAPI FP INS PAPI FML INS PAPI FAD INS PAPI FDV INS PAPI FSQ INS PAPI FNV INS COSC 6374 Parallel Computation Edgar Gabriel Floating Floating Floating Floating Floating Floating point point point point point point instructions multiply add divide square root inverse Slides based on a talk and courtesy of Andreas Knuepfer and Matthias Mueller Center for Information Services and High Performance Computing Technical University Dresden Cache Counters PAPI L 1 2 3 D I T C M H A R W Cache level 1 2 3 D I T data instruction total cache M H A R W misses hits accesses reads writes PAPI L 1 2 3 LD ST M Cache level 1 2 3 LD ST load store misses PAPI PRF DM Data prefetch cache misses COSC 6374 Parallel Computation Edgar Gabriel Slides based on a talk and courtesy of Andreas Knuepfer and Matthias Mueller Center for Information Services and High Performance Computing Technical University Dresden 9 PAPI manual example PAPI library init PAPI VER CURRENT query and set up the right events to monitor i 1 if PAPI query event PAPI FP INS PAPI OK Events i PAPI FP INS if PAPI query event PAPI TOT INS PAPI OK Events i PAPI TOT INS PAPI start counters int Events NUM EVENTS Execute the real code do flops NUM FLOPS PAPI read counters
View Full Document
Unlocking...