Parallel Application Scaling, Performance, and EfficiencyParallel Scaling of MPI CodesTopicsScale: Practical ImportanceLet’s jump to an exampleSharks and Fish II : ProgramSharks and Fish II: How fast?Scaling: Good 1st Step: Do runtimes make sense?Scaling: WalltimesScaling: terminologyScaling: SpeedupsScaling: EfficienciesScaling: AnalysisSlide 14Scaling: Superlinear SpeedupStrong Scaling: Communication BoundSharks and Atoms:Slide 18Load Balance : Application CartoonLoad Balance : performance dataLoad Balance: ~codeLoad Balance: real codeLoad Balance : analysisLoad Balance : FFTLoad Balance: SummarySlide 26Scaling of MPI_Barrier()Synchronization: terminologySynchronization : MPI Functionsseaborg.nersc.govseaborg.nersc.gov basicsMPI on the IBM SPMPI: seaborg.nersc.govSeaborg : point to point messagingInter-Node BandwidthMPI Performance is often HierarchicalMPI: Latency not always 1 or 2 numbersSynchronization: measurementSynchronization: MPI CollectivesSynchronization: ArchitectureIntrinsic Synchronization : AlltoallIntrinsic Synchronization: AlltoallThis leads to variability in Execution TimeSynchronization : SummarySlide 45Simple StuffWhat’s wrong here?MPI_BarrierSlide 49Parallel File I/O : StrategiesParallel File I/O: MetadataSlide 52Other sources of information:Dynamical Load Balance: MotivationSlide 55Parallel Application Scaling, Performance, and EfficiencyDavid Skinner NERSC/LBLParallel Scaling of MPI CodesA practical talk on using MPI with focus on:•Distribution of work within a parallel program •Placement of computation within a parallel computer•Performance costs of different types of communication•Understanding scaling performance terminologyTopics•Application Scaling•Load Balance•Synchronization•Simple stuff•File I/OScale: Practical ImportanceTime required to compute the NxN matrix product C=A*B Assuming you can address 64GB from one task, can you wait a month?How to balancecomputational goal vs.compute resources?Choose the right scale!Let’s jump to an example•Sharks and Fish II : N2 parallel force evalulation •e.g. 4 CPUs evaluate force for 125 fish •Domain decomposition: Each CPU is “in charge” of ~31 fish, but keeps a fairly recent copy of all the fishes positions (replicated data) •Is it not possible to uniformly decompose problems in general, especially in many dimensions•This toy problem is simple, has fine granularity and is 2D• Let’s see how it scales31 31 31 32Sharks and Fish II : ProgramData:n_fish globalmy_fish localfishi = {x, y, …}Dynamics:F = ma…V = Σ 1/rij dq/dt = m * pdp/dt = -dV/dqMPI_Allgatherv(myfish_buf, len[rank], MPI_FishType…) for (i = 0; i < my_fish; ++i) { for (j = 0; j < n_fish; ++j) { // i!=j ai += g * massj * ( fishi – fishj ) / rij }}Move fish•100 fish can move 1000 steps in1 task 5.459s32 tasks 2.756s•1000 fish can move 1000 steps in 1 task 511.14s32 tasks 20.815s•So what’s the “best” way to run?–How many fish do we really have?–How large a computer (time) do we have? –How quickly do we need the answer?Sharks and Fish II: How fast?x 24.6 speedupx 1.98 speedupScaling: Good 1st Step: Do runtimes make sense?1 Task32 Tasks…Running fish_sim for 100-1000 fish on 1-32 CPUs we seetime ~ fish2 Scaling: Walltimeswalltime is (all)important but let’s look at some other scaling metrics Each line is contour describing computations doable in a given timeScaling: terminology•Scaling studies involve changing the degree of parallelism. Will we be changing the problem also?–Strong scaling Fixed problem size–Weak scaling Problem size grows with additional compute resources•How do we measure success in parallel scaling?–Speed up = Ts/Tp(n)–Efficiency = Ts/(n*Tp(n))Multiple definitions exist!Scaling: SpeedupsScaling: EfficienciesRemarkably smooth! Often algorithm and architecture make efficiency landscape quite complexScaling: AnalysisWhy does efficiency drop?–Serial code sections Amdahl’s law–Surface to Volume Communication bound–Algorithm complexity or switching–Communication protocol switching–Scalability of computer and interconnect Whoa!Scaling: Analysis•In general, changing problem size and concurrency expose or remove compute resources. Bottlenecks shift. •In general, first bottleneck wins. •Scaling brings additional resources too. –More CPUs (of course)–More cache(s)–More memory BW in some casesScaling: Superlinear Speedup # CPUs(OMP)Strong Scaling: Communication Bound64 tasks , 52% comm192 tasks , 66% comm768 tasks , 79% commMPI_Allreduce buffer size is 32 bytes.Q: What resource is being depleted here?A: Small message latency 1) Compute per task is decreasing2) Synchronization rate is increasing3) Surface:Volume ratio is increasingSharks and Atoms: At HPC centers like NERSC fish are rarely modeled as point masses. The associated algorithms and their scalings are none the less of great practical importance for scientific problems.Particle Mesh EwaldMatSci Computation600s@256 way or80s@1024 wayTopics•Load Balance•Synchronization•Simple stuff•File I/O Now instead of looking at scaling of specific applications lets look at general issues in parallel application scalabilityLoad Balance : Application CartoonUniversal App Unbalanced:Balanced:Time saved by load balanceWill define synchronization laterLoad Balance : performance dataMPI ranks sorted by total communication time Communication Time: 64 tasks show 200s, 960 tasks show 230sLoad Balance: ~codewhile(1) { do_flops(Ni); MPI_Alltoall(); MPI_Allreduce();}960 x64xLoad Balance: real codeSyncFlopsExchangeTime MPI Rank Load Balance : analysis•The 64 slow tasks (with more compute work) cause 30 seconds more “communication” in 960 tasks• This leads to 28800 CPU*seconds (8 CPU*hours) of unproductive computing•All load imbalance requires is one slow task and a synchronizing collective!•Pair well problem size and concurrency.•Parallel computers allow you to waste time faster!Load Balance : FFTQ: When is imbalance good? A: When is leads to a faster Algorithm.Load Balance: Summary•Imbalance is most often a byproduct of data decomposition•Must be addressed before further MPI tuning can happen•Good software exists for graph partitioning / remeshing •For regular grids consider padding or contractingTopics•Load Balance•Synchronization•Simple stuff•File I/OScaling of MPI_Barrier()fourorders of
View Full Document