DOC PREVIEW
Berkeley COMPSCI C267 - Parallel Application Scaling, Performance, and Efficiency

This preview shows page 1-2-3-4-25-26-27-52-53-54-55 out of 55 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 55 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 55 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 55 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 55 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 55 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 55 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 55 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 55 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 55 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 55 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 55 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 55 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Parallel Application Scaling, Performance, and EfficiencyParallel Scaling of MPI CodesTopicsScale: Practical ImportanceLet’s jump to an exampleSharks and Fish II : ProgramSharks and Fish II: How fast?Scaling: Good 1st Step: Do runtimes make sense?Scaling: WalltimesScaling: terminologyScaling: SpeedupsScaling: EfficienciesScaling: AnalysisSlide 14Scaling: Superlinear SpeedupStrong Scaling: Communication BoundSharks and Atoms:Slide 18Load Balance : Application CartoonLoad Balance : performance dataLoad Balance: ~codeLoad Balance: real codeLoad Balance : analysisLoad Balance : FFTLoad Balance: SummarySlide 26Scaling of MPI_Barrier()Synchronization: terminologySynchronization : MPI Functionsseaborg.nersc.govseaborg.nersc.gov basicsMPI on the IBM SPMPI: seaborg.nersc.govSeaborg : point to point messagingInter-Node BandwidthMPI Performance is often HierarchicalMPI: Latency not always 1 or 2 numbersSynchronization: measurementSynchronization: MPI CollectivesSynchronization: ArchitectureIntrinsic Synchronization : AlltoallIntrinsic Synchronization: AlltoallThis leads to variability in Execution TimeSynchronization : SummarySlide 45Simple StuffWhat’s wrong here?MPI_BarrierSlide 49Parallel File I/O : StrategiesParallel File I/O: MetadataSlide 52Other sources of information:Dynamical Load Balance: MotivationSlide 55Parallel Application Scaling, Performance, and EfficiencyDavid Skinner NERSC/LBLParallel Scaling of MPI CodesA practical talk on using MPI with focus on:•Distribution of work within a parallel program •Placement of computation within a parallel computer•Performance costs of different types of communication•Understanding scaling performance terminologyTopics•Application Scaling•Load Balance•Synchronization•Simple stuff•File I/OScale: Practical ImportanceTime required to compute the NxN matrix product C=A*B Assuming you can address 64GB from one task, can you wait a month?How to balancecomputational goal vs.compute resources?Choose the right scale!Let’s jump to an example•Sharks and Fish II : N2 parallel force evalulation •e.g. 4 CPUs evaluate force for 125 fish •Domain decomposition: Each CPU is “in charge” of ~31 fish, but keeps a fairly recent copy of all the fishes positions (replicated data) •Is it not possible to uniformly decompose problems in general, especially in many dimensions•This toy problem is simple, has fine granularity and is 2D• Let’s see how it scales31 31 31 32Sharks and Fish II : ProgramData:n_fish  globalmy_fish  localfishi = {x, y, …}Dynamics:F = ma…V = Σ 1/rij dq/dt = m * pdp/dt = -dV/dqMPI_Allgatherv(myfish_buf, len[rank], MPI_FishType…) for (i = 0; i < my_fish; ++i) { for (j = 0; j < n_fish; ++j) { // i!=j ai += g * massj * ( fishi – fishj ) / rij }}Move fish•100 fish can move 1000 steps in1 task  5.459s32 tasks  2.756s•1000 fish can move 1000 steps in 1 task  511.14s32 tasks  20.815s•So what’s the “best” way to run?–How many fish do we really have?–How large a computer (time) do we have? –How quickly do we need the answer?Sharks and Fish II: How fast?x 24.6 speedupx 1.98 speedupScaling: Good 1st Step: Do runtimes make sense?1 Task32 Tasks…Running fish_sim for 100-1000 fish on 1-32 CPUs we seetime ~ fish2 Scaling: Walltimeswalltime is (all)important but let’s look at some other scaling metrics Each line is contour describing computations doable in a given timeScaling: terminology•Scaling studies involve changing the degree of parallelism. Will we be changing the problem also?–Strong scaling Fixed problem size–Weak scaling Problem size grows with additional compute resources•How do we measure success in parallel scaling?–Speed up = Ts/Tp(n)–Efficiency = Ts/(n*Tp(n))Multiple definitions exist!Scaling: SpeedupsScaling: EfficienciesRemarkably smooth! Often algorithm and architecture make efficiency landscape quite complexScaling: AnalysisWhy does efficiency drop?–Serial code sections  Amdahl’s law–Surface to Volume  Communication bound–Algorithm complexity or switching–Communication protocol switching–Scalability of computer and interconnect Whoa!Scaling: Analysis•In general, changing problem size and concurrency expose or remove compute resources. Bottlenecks shift. •In general, first bottleneck wins. •Scaling brings additional resources too. –More CPUs (of course)–More cache(s)–More memory BW in some casesScaling: Superlinear Speedup # CPUs(OMP)Strong Scaling: Communication Bound64 tasks , 52% comm192 tasks , 66% comm768 tasks , 79% commMPI_Allreduce buffer size is 32 bytes.Q: What resource is being depleted here?A: Small message latency 1) Compute per task is decreasing2) Synchronization rate is increasing3) Surface:Volume ratio is increasingSharks and Atoms: At HPC centers like NERSC fish are rarely modeled as point masses. The associated algorithms and their scalings are none the less of great practical importance for scientific problems.Particle Mesh EwaldMatSci Computation600s@256 way or80s@1024 wayTopics•Load Balance•Synchronization•Simple stuff•File I/O Now instead of looking at scaling of specific applications lets look at general issues in parallel application scalabilityLoad Balance : Application CartoonUniversal App Unbalanced:Balanced:Time saved by load balanceWill define synchronization laterLoad Balance : performance dataMPI ranks sorted by total communication time Communication Time: 64 tasks show 200s, 960 tasks show 230sLoad Balance: ~codewhile(1) { do_flops(Ni); MPI_Alltoall(); MPI_Allreduce();}960 x64xLoad Balance: real codeSyncFlopsExchangeTime MPI Rank Load Balance : analysis•The 64 slow tasks (with more compute work) cause 30 seconds more “communication” in 960 tasks• This leads to 28800 CPU*seconds (8 CPU*hours) of unproductive computing•All load imbalance requires is one slow task and a synchronizing collective!•Pair well problem size and concurrency.•Parallel computers allow you to waste time faster!Load Balance : FFTQ: When is imbalance good? A: When is leads to a faster Algorithm.Load Balance: Summary•Imbalance is most often a byproduct of data decomposition•Must be addressed before further MPI tuning can happen•Good software exists for graph partitioning / remeshing •For regular grids consider padding or contractingTopics•Load Balance•Synchronization•Simple stuff•File I/OScaling of MPI_Barrier()fourorders of


View Full Document

Berkeley COMPSCI C267 - Parallel Application Scaling, Performance, and Efficiency

Documents in this Course
Lecture 4

Lecture 4

52 pages

Split-C

Split-C

5 pages

Lecture 5

Lecture 5

40 pages

Load more
Download Parallel Application Scaling, Performance, and Efficiency
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Parallel Application Scaling, Performance, and Efficiency and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Parallel Application Scaling, Performance, and Efficiency 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?