DOC PREVIEW
Berkeley COMPSCI C267 - Lecture 6: Distributed Memory Data Parallel Programming

This preview shows page 1-2-3-20-21-40-41-42 out of 42 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 42 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 42 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 42 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 42 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 42 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 42 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 42 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 42 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 42 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

CS 267 Applications of Parallel Computers Lecture 6: Distributed Memory (continued) Data Parallel ProgrammingRecap of Last LectureOutlineExample: Sharks and Fish2 Algorithms for Gravity: What are their costs?More Algorithms for GravityCS 267 Applications of Parallel Computers Lecture 11: Data Parallel ProgrammingSlide 8Data Parallel ArchitecturesVector MachinesUse of SIMD Model on Vector MachinesEvolution of Data ParallelismFortran90 Execution ModelExample: gravitational fishArray OperationsArray SectionsReduction OperatorsConditional OperationForall in HPF (Extends F90)Conditional (masked) intrinsicsSubroutinesImplicit CommunicationGlobal CommunicationSpecialized CommunicationExample: nbody calculationHPF Data Distribution (layout) directivesAlignmentLayoutsExampleIndependentExtra slidesRecap: Historical PerspectiveBasics of a Parallel LanguageWhere are things goingScan OperationsCMF Homes and LayoutsCMF Layout Example2D ElectromagneticsStencil CalculationsBlockingLoad balancing in Wator worlds?Other Data Parallel LanguagesCS267 L6 Data Parallel Programming.1Demmel Sp 1999CS 267 Applications of Parallel ComputersLecture 6: Distributed Memory (continued)Data Parallel Programming James Demmelhttp://www.cs.berkeley.edu/~demmel/cs267_Spr99CS267 L6 Data Parallel Programming.2Demmel Sp 1999Recap of Last Lecture°Distributed memory machines•Each processor has independent memory•Connected by network°Cost = #messages *  + #words_sent *  + #flops * f°Distributed memory programming •MPI•Send/Receive•Collective Communication•Sharks and Fish under gravity as exampleCS267 L6 Data Parallel Programming.3Demmel Sp 1999Outline°Distributed Memory Programming (continued)•Review Gravity Algorithms•Look at Sharks and Fish code°Data Parallel ProgrammingCS267 L6 Data Parallel Programming.4Demmel Sp 1999Example: Sharks and Fish°N fish on P procs, N/P fish per processor•At each time step, compute forces on fish and move them°Need to compute gravitational interaction•In usual n^2 algorithm, every fish depends on every other fish•every fish needs to “visit” every processor, even if it “lives” on one°What is the cost?CS267 L6 Data Parallel Programming.5Demmel Sp 19992 Algorithms for Gravity: What are their costs?Algorithm 1 Copy local Fish array of length N/P to Tmp array for j = 1 to N for k = 1 to N/P, Compute force from Tmp(k) on Fish(k) “Rotate” Tmp by 1 for k=2 to N/P, Tmp(k) <= Tmp(k-1) recv(my_proc - 1,Tmp(1)) send(my_proc+1,Tmp(N/P)Algorithm 2 Copy local Fish array of length N/P to Tmp array for j = 1 to P for k=1 to N/P, for m=1 to N/P, Compute force from Tmp(k) on Fish(m) “Rotate” Tmp by N/P recv(my_proc - 1,Tmp(1:N/P)) send(my_proc+1,Tmp(1:N/P))What could go wrong? (be careful of overwriting Tmp)CS267 L6 Data Parallel Programming.6Demmel Sp 1999More Algorithms for Gravity°Algorithm 3 (in sharks and fish code)•All processors send their Fish to Proc 0•Proc 0 broadcasts all Fish to all processors°Tree-algorithms•Barnes-Hut, Greengard-Rokhlin, Anderson•O(N log N) instead of O(N^2)•Parallelizable with cleverness•“Just” an approximation, but as accurate as you like (often only a few digits are needed, so why pay for more)•Same idea works for other problems where effects of distant objects becomes “smooth” or “compressible” -electrostatics, vorticity, …-radiosity in graphics-anything satisfying Poisson equation or something like it•Will talk about it in detail later in courseCS267 L6 Data Parallel Programming.7Demmel Sp 1999CS 267 Applications of Parallel ComputersLecture 11: Data Parallel ProgrammingKathy Yelickhttp://www.cs.berkeley.edu/~dmartin/cs267/CS267 L6 Data Parallel Programming.8Demmel Sp 1999Outline°Break for Mat Mul results °Quick Evolution of Data-Parallel Machines°Fortran 90°HPF extensionsCS267 L6 Data Parallel Programming.9Demmel Sp 1999Data Parallel Architectures°Programming model •operations are performed on each element of a large (regular) data structure in a single step•arithmetic, global data transfer°processor is logically associated with each data element, general communication, and cheap global synchronization.•driven originally by simple O.D.E.ControlProcessorP-M P-M P-MP-M P-M P-MP-M P-M P-MCS267 L6 Data Parallel Programming.10Demmel Sp 1999Vector Machines°The Cray-1•RISC machine with load/store vector architectureCompact vector instructions, with non-unit stride.High-performance scalar (ECL, single phase, 80 MHZ, freon cooled)Tightly integrated scalar and vector => low start-up costFunction Unit Chaining (C = Ax + B)vectorregisterspipelined function unitshighly interleaved semiconductor (SRAM) memoryCS267 L6 Data Parallel Programming.11Demmel Sp 1999Use of SIMD Model on Vector MachinesVP0VP1VP$vlr-1vr0vr1vr31vf0vf1vf31$vdw bits1 bitGeneralPurposeRegisters(32)FlagRegisters(32)Virtual Processors ($vlr)vcr0vcr1vcr15ControlRegisters32 bitsCS267 L6 Data Parallel Programming.12Demmel Sp 1999Evolution of Data Parallelism°Rigid control structure (SIMD in Flynn’s Taxonomy)•SISD = uniprocessor, MIMD = multiprocessor°Cost savings in centralized instruction sequencer°Simple, regular calculations usually have good locality•realize on SM or MP machine with decent compiler•may still require fast global synchronization°Programming model converges to SPMD•data parallel appears as convention in many languagesCS267 L6 Data Parallel Programming.13Demmel Sp 1999Fortran90 Execution Model• Sequential composition of parallel (or scalar) statements• •Parallel operations on arrays• Arrays have rank (# dimensions), shape (extents), type (elements)– HPF adds layout• Communication implicit in array operations• Configuration independentCS267 L6 Data Parallel Programming.14Demmel Sp 1999Example: gravitational fish integer, parameter :: nfish = 10000 complex fishp(nfish), fishv(nfish), force(nfish), accel(nfish) real fishm(nfish). . . do while (t < tfinal) t = t + dt fishp = fishp + dt*fishv call compute_current(force,fishp) accel = force/fishm fishv = fishv + dt*accel ... enddo. . . subroutine compute_current(force,fishp) complex force(:),fishp(:) force = (3,0)*(fishp*(0,1))/(max(abs(fishp),0.01)) - fishp end*+parallel assignmentpointwise parallel operatorCS267 L6 Data Parallel


View Full Document

Berkeley COMPSCI C267 - Lecture 6: Distributed Memory Data Parallel Programming

Documents in this Course
Lecture 4

Lecture 4

52 pages

Split-C

Split-C

5 pages

Lecture 5

Lecture 5

40 pages

Load more
Download Lecture 6: Distributed Memory Data Parallel Programming
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture 6: Distributed Memory Data Parallel Programming and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture 6: Distributed Memory Data Parallel Programming 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?