Berkeley COMPSCI C267 - Lecture 6: Distributed Memory Data Parallel Programming - D2816852

Home> Schools> University of California, Berkeley> Computer Science (COMPSCI) > COMPSCI C267> Lecture 6: Distributed Memory Data Parallel Programming

DOC PREVIEW

Berkeley COMPSCI C267 - Lecture 6: Distributed Memory Data Parallel Programming

School name University of California, Berkeley

Course Compsci C267- Applications of Parallel Computers

Pages 42

This preview shows page 1-2-3-20-21-40-41-42 out of 42 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 42 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 42 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 42 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 42 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 42 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 42 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 42 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 42 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 42 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

CS 267 Applications of Parallel Computers Lecture 6: Distributed Memory (continued) Data Parallel ProgrammingRecap of Last LectureOutlineExample: Sharks and Fish2 Algorithms for Gravity: What are their costs?More Algorithms for GravityCS 267 Applications of Parallel Computers Lecture 11: Data Parallel ProgrammingSlide 8Data Parallel ArchitecturesVector MachinesUse of SIMD Model on Vector MachinesEvolution of Data ParallelismFortran90 Execution ModelExample: gravitational fishArray OperationsArray SectionsReduction OperatorsConditional OperationForall in HPF (Extends F90)Conditional (masked) intrinsicsSubroutinesImplicit CommunicationGlobal CommunicationSpecialized CommunicationExample: nbody calculationHPF Data Distribution (layout) directivesAlignmentLayoutsExampleIndependentExtra slidesRecap: Historical PerspectiveBasics of a Parallel LanguageWhere are things goingScan OperationsCMF Homes and LayoutsCMF Layout Example2D ElectromagneticsStencil CalculationsBlockingLoad balancing in Wator worlds?Other Data Parallel LanguagesCS267 L6 Data Parallel Programming.1Demmel Sp 1999CS 267 Applications of Parallel ComputersLecture 6: Distributed Memory (continued)Data Parallel Programming James Demmelhttp://www.cs.berkeley.edu/~demmel/cs267_Spr99CS267 L6 Data Parallel Programming.2Demmel Sp 1999Recap of Last Lecture°Distributed memory machines•Each processor has independent memory•Connected by network°Cost = #messages *  + #words_sent *  + #flops * f°Distributed memory programming •MPI•Send/Receive•Collective Communication•Sharks and Fish under gravity as exampleCS267 L6 Data Parallel Programming.3Demmel Sp 1999Outline°Distributed Memory Programming (continued)•Review Gravity Algorithms•Look at Sharks and Fish code°Data Parallel ProgrammingCS267 L6 Data Parallel Programming.4Demmel Sp 1999Example: Sharks and Fish°N fish on P procs, N/P fish per processor•At each time step, compute forces on fish and move them°Need to compute gravitational interaction•In usual n^2 algorithm, every fish depends on every other fish•every fish needs to “visit” every processor, even if it “lives” on one°What is the cost?CS267 L6 Data Parallel Programming.5Demmel Sp 19992 Algorithms for Gravity: What are their costs?Algorithm 1 Copy local Fish array of length N/P to Tmp array for j = 1 to N for k = 1 to N/P, Compute force from Tmp(k) on Fish(k) “Rotate” Tmp by 1 for k=2 to N/P, Tmp(k) <= Tmp(k-1) recv(my_proc - 1,Tmp(1)) send(my_proc+1,Tmp(N/P)Algorithm 2 Copy local Fish array of length N/P to Tmp array for j = 1 to P for k=1 to N/P, for m=1 to N/P, Compute force from Tmp(k) on Fish(m) “Rotate” Tmp by N/P recv(my_proc - 1,Tmp(1:N/P)) send(my_proc+1,Tmp(1:N/P))What could go wrong? (be careful of overwriting Tmp)CS267 L6 Data Parallel Programming.6Demmel Sp 1999More Algorithms for Gravity°Algorithm 3 (in sharks and fish code)•All processors send their Fish to Proc 0•Proc 0 broadcasts all Fish to all processors°Tree-algorithms•Barnes-Hut, Greengard-Rokhlin, Anderson•O(N log N) instead of O(N^2)•Parallelizable with cleverness•“Just” an approximation, but as accurate as you like (often only a few digits are needed, so why pay for more)•Same idea works for other problems where effects of distant objects becomes “smooth” or “compressible” -electrostatics, vorticity, …-radiosity in graphics-anything satisfying Poisson equation or something like it•Will talk about it in detail later in courseCS267 L6 Data Parallel Programming.7Demmel Sp 1999CS 267 Applications of Parallel ComputersLecture 11: Data Parallel ProgrammingKathy Yelickhttp://www.cs.berkeley.edu/~dmartin/cs267/CS267 L6 Data Parallel Programming.8Demmel Sp 1999Outline°Break for Mat Mul results °Quick Evolution of Data-Parallel Machines°Fortran 90°HPF extensionsCS267 L6 Data Parallel Programming.9Demmel Sp 1999Data Parallel Architectures°Programming model •operations are performed on each element of a large (regular) data structure in a single step•arithmetic, global data transfer°processor is logically associated with each data element, general communication, and cheap global synchronization.•driven originally by simple O.D.E.ControlProcessorP-M P-M P-MP-M P-M P-MP-M P-M P-MCS267 L6 Data Parallel Programming.10Demmel Sp 1999Vector Machines°The Cray-1•RISC machine with load/store vector architectureCompact vector instructions, with non-unit stride.High-performance scalar (ECL, single phase, 80 MHZ, freon cooled)Tightly integrated scalar and vector => low start-up costFunction Unit Chaining (C = Ax + B)vectorregisterspipelined function unitshighly interleaved semiconductor (SRAM) memoryCS267 L6 Data Parallel Programming.11Demmel Sp 1999Use of SIMD Model on Vector MachinesVP0VP1VP$vlr-1vr0vr1vr31vf0vf1vf31$vdw bits1 bitGeneralPurposeRegisters(32)FlagRegisters(32)Virtual Processors ($vlr)vcr0vcr1vcr15ControlRegisters32 bitsCS267 L6 Data Parallel Programming.12Demmel Sp 1999Evolution of Data Parallelism°Rigid control structure (SIMD in Flynn’s Taxonomy)•SISD = uniprocessor, MIMD = multiprocessor°Cost savings in centralized instruction sequencer°Simple, regular calculations usually have good locality•realize on SM or MP machine with decent compiler•may still require fast global synchronization°Programming model converges to SPMD•data parallel appears as convention in many languagesCS267 L6 Data Parallel Programming.13Demmel Sp 1999Fortran90 Execution Model• Sequential composition of parallel (or scalar) statements• •Parallel operations on arrays• Arrays have rank (# dimensions), shape (extents), type (elements)– HPF adds layout• Communication implicit in array operations• Configuration independentCS267 L6 Data Parallel Programming.14Demmel Sp 1999Example: gravitational fish integer, parameter :: nfish = 10000 complex fishp(nfish), fishv(nfish), force(nfish), accel(nfish) real fishm(nfish). . . do while (t < tfinal) t = t + dt fishp = fishp + dt*fishv call compute_current(force,fishp) accel = force/fishm fishv = fishv + dt*accel ... enddo. . . subroutine compute_current(force,fishp) complex force(:),fishp(:) force = (3,0)*(fishp*(0,1))/(max(abs(fishp),0.01)) - fishp end*+parallel assignmentpointwise parallel operatorCS267 L6 Data Parallel

View Full Document

Berkeley COMPSCI C267 - Lecture 6: Distributed Memory Data Parallel Programming

Sign up for free to view:

This document and 3 million+ documents and flashcards
High quality study guides, lecture notes, practice exams
Course Packets handpicked by editors offering a comprehensive review of your courses
Better Grades Guaranteed


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-3-20-21-40-41-42 out of 42 pages.

Berkeley COMPSCI C267 - Lecture 6: Distributed Memory Data Parallel Programming

Sign up for free to view:

Please select your school