This preview shows page 1-2-3-4-5-6-7-48-49-50-51-52-53-54-97-98-99-100-101-102-103 out of 103 pages.
Performance Analysis ToolsOutlineMotivationSlide 4Concepts and DefinitionsInstrumentationInstrumentation – Examples (1)Instrumentation – Examples (2)Instrumentation – Examples (3)Instrumentation – Examples (4)Instrumentation – Examples (5)MeasurementMeasurement: ProfilingProfiling: Inclusive vs. ExclusiveTracing Example: Instrumentation, Monitor, TraceTracing: Timeline VisualizationMeasurement: TracingPerformance Data AnalysisTrace File VisualizationSlide 203D performance data explorationAutomated Performance AnalysisAutomation ExampleSlide 24Slide 25What is PAPIPAPI Hardware EventsWhere is PAPIPAPI Counter InterfacesPAPI High-level InterfacePAPI High-level ExamplePAPI Low-level InterfaceMany tools in the HPC space are built on top of PAPIComponent PAPI (PAPI-C)Component PAPI DesignSlide 37OpenMPOpenMP Performance Analysis with ompPUsage exampleompP’s Profiling ReportProfiling DataFlat Region Profile (2)CallgraphCallgraph (2)Overhead Analysis (1)Overhead Analysis (2)ompP’s Overhead Analysis ReportOpenMP Scalability AnalysisSPEC OpenMP Benchmarks (1)SPEC OpenMP Benchmarks (2)Incremental Profiling (1)Incremental Profiling (2)Incremental Profiling (3)Incremental ProfilingIncremental Profiling Profiling: Data Views (2)Slide 59Slide 60IPM: Design GoalsIPM: MethodologyHow to use IPM : basicsWant more detail? IPM_REPORT=fullSlide 65IPM: XML log filesMessage Sizes : CAM 336 wayScalability: RequiredMore than a pretty pictureScalability: InsightPortability: Profoundly InterestingSlide 72Vampir overview statisticsTimeline displayTimeline display – message detailsCommunication statisticsMessage histogramsCollective operationsActivity chartProcess–local displaysEffects of zoomingSlide 86Basic IdeaMPI-1 Pattern: Wait at BarrierMPI-1 Pattern: Late Sender / ReceiverSlide 90KOJAK: sPPM run on (8x16x14) 1792 PEsSlide 92TAU Parallel Performance SystemParaProf – 3D Scatterplot (Miranda)ParaProf – 3D Scatterplot (SWEEP3D CUBE)PerfExplorer - Cluster AnalysisPerfExplorer - Correlation Analysis (Flash)Slide 98Documentation, Manuals, User GuidesThe space is bigSlide 101Sharks and Fish IISharks and Fish II : ProgramSharks and Fish II: How fast?Scaling: Good 1st Step: Do runtimes make sense?Scaling: WalltimesScaling: DefinitionsScaling: SpeedupsScaling: EfficienciesScaling: AnalysisPerformance Analysis ToolsKarl [email protected] slides from David Skinner, Sameer Shende, Shirley Moore, Bernd Mohr, Felix Wolf, Hans Christian Hoppe and others.CS267 - Performance Analysis Tools | 2Karl FuerlingerOutlineMotivation–Why do we care about performanceConcepts and definitions–The performance analysis cycle–Instrumentation–Measurement: profiling vs. tracing–Analysis: manual vs. automated Tools–PAPI: Access to hardware performance counters–ompP: Profiling of OpenMP applications–IPM: Profiling of MPI apps–Vampir: Trace visualization–KOJAK/Scalasca: Automated bottleneck detection of MPI/OpenMP applications–TAU: Toolset for profiling and tracing of MPI/OpenMP/Java/Python applicationsCS267 - Performance Analysis Tools | 3Karl FuerlingerMotivationPerformance Analysis is important –Large investments in HPC systems•Procurement: ~$40 Mio•Operational costs: ~$5 Mio per year•Electricity: 1 MWyear ~$1 Mio–Goal: solve larger problems–Goal: solve problems fasterCS267 - Performance Analysis Tools | 4Karl FuerlingerOutlineMotivation–Why do we care about performanceConcepts and definitions–The performance analysis cycle–Instrumentation–Measurement: profiling vs. tracing–Analysis: manual vs. automated Tools–PAPI: Access to hardware performance counters–ompP: Profiling of OpenMP applications–IPM: Profiling of MPI apps–Vampir: Trace visualization–KOJAK/Scalasca: Automated bottleneck detection of MPI/OpenMP applications–TAU: Toolset for profiling and tracing of MPI/OpenMP/Java/Python applicationsCS267 - Performance Analysis Tools | 5Karl FuerlingerConcepts and DefinitionsThe typical performance optimization cycleCode DevelopmentUsage / ProductionMeasureAnalyzeModify / Tunefunctionally complete and correct programcomplete, cor-rect and well-performingprograminstrumentationCS267 - Performance Analysis Tools | 6Karl FuerlingerInstrumentationInstrumentation = adding measurement probes to the code to observe its executionCan be done on several levelsDifferent techniques for different levelsDifferent overheads and levels of accuracy with each techniqueNo instrumentation: run in a simulator. E.g., Valgrind User-level abstractions problem domainsource codesource codeobject code librariesinstrumentationinstrumentationexecutableruntime imagecompilerlinkerOSVMinstrumentationinstrumentationinstrumentationinstrumentationinstrumentationinstrumentationperformancedatarunpreprocessorCS267 - Performance Analysis Tools | 7Karl FuerlingerInstrumentation – Examples (1)Source code instrumentation–User added time measurement, etc. (e.g., printf(), gettimeofday())–Many tools expose mechanisms for source code instrumentation in addition to automatic instrumentation facilities they offer–Instrument program phases: •initialization/main iteration loop/data post processing–Pramga and pre-processor based#pragma pomp inst begin(foo)#pragma pomp inst end(foo)–Macro / function call basedELG_USER_START("name");...ELG_USER_END("name");CS267 - Performance Analysis Tools | 8Karl FuerlingerInstrumentation – Examples (2)Preprocessor Instrumentation–Example: Instrumenting OpenMP constructs with Opari–Preprocessor operation–Example: Instrumentation of a parallel region/* ORIGINAL CODE in parallel region */Instrumentation added by OpariOrignialsource codeModified (instrumented)source codePre-processorThis is used for OpenMP analysis in tools such as KoJak/Scalasca/ompPCS267 - Performance Analysis Tools | 9Karl FuerlingerInstrumentation – Examples (3)Compiler Instrumentation–Many compilers can instrument functions automatically–GNU compiler flag: -finstrument-functions –Automatically calls functions on function entry/exit that a tool can capture–Not standardized across compilers, often undocumented flags, sometimes not available at all–GNU compiler example:void __cyg_profile_func_enter(void *this, void *callsite) {/* called on function entry */}void __cyg_profile_func_exit(void *this, void *callsite){/* called just before returning from function */}CS267 - Performance Analysis Tools |
View Full Document