Parallel ProgrammingWhy Parallel Programming?Code that can be parallelizedParallel ComputersDistributed Memory ArchitectureMessage Passing InterfaceShared Memory ArchitectureOpenMPSlide 9ApproachesParallel LanguagesFortran for parallelismMore parallel languagesObject-OrientedFunctionalParallelizing CompilersData DependencesExampleDependenciesSlide 20Another ExampleYet Another ExampleParallel CompilersDistributed computing1Parallel ProgrammingAaron BloomfieldCS 415Fall 20052Why Parallel Programming?•Predict weather•Predict spread of SARS•Predict path of hurricanes•Predict oil slick propagation•Model growth of bio-plankton/fisheries•Structural simulations•Predict path of forest fires•Model formation of galaxies•Simulate nuclear explosions3Code that can be parallelizeddo i= 1 to max, a[i] = b[i] + c[i] * d[i]end do4Parallel Computers•Programming mode types–Shared memory–Message passing5Distributed Memory Architecture•Each Processor has direct access only to its local memory•Processors are connected via high-speed interconnect•Data structures must be distributed•Data exchange is done via explicit processor-to-processor communication: send/receive messages•Programming Models–Widely used standard: MPI–Others: PVM, Express, P4, Chameleon, PARMACS, ...P0Communication Interconnect...MemoryMemoryMemoryP0P1Pn6Message Passing InterfaceMPI provides:•Point-to-point communication•Collective operations–Barrier synchronization–gather/scatter operations–Broadcast, reductions•Different communication modes–Synchronous/asynchronous–Blocking/non-blocking–Buffered/unbuffered•Predefined and derived datatypes•Virtual topologies•Parallel I/O (MPI 2)•C/C++ and Fortran bindings•http://www.mpi-f orum.org7Shared Memory Architecture•Processors have direct access to global memory and I/O through bus or fast switching network•Cache Coherency Protocol guarantees consistency of memory and I/O accesses•Each processor also has its own memory (cache)•Data structures are shared in global address space•Concurrent access to shared memory must be coordinated•Programming Models–Multithreading (Thread Libraries)–OpenMPP0CacheP0CacheP1CachePnCacheGlobal Shared MemoryShared Bus...8OpenMP•OpenMP: portable shared memory parallelism•Higher-level API for writing portable multithreaded applications•Provides a set of compiler directives and library routines for parallel application programmers•API bindings for Fortran, C, and C++http://www.OpenMP.org910Approaches•Parallel Algorithms•Parallel Language•Message passing (low-level)•Parallelizing compilers11Parallel Languages•CSP - Hoare’s notation for parallelism as a network of sequential processes exchanging messages. •Occam - Real language based on CSP. Used for the transputer, in Europe.12Fortran for parallelism•Fortran 90 - Array language. Triplet notation for array sections. Operations and intrinsic functions possible on array sections.•High Performance Fortran (HPF) - Similar to Fortran 90, but includes data layout specifications to help the compiler generate efficient code.13More parallel languages•ZPL - array-based language at UW. Compiles into C code (highly portable).•C* - C extended for parallelism14Object-Oriented•Concurrent Smalltalk•Threads in Java, Ada, thread libraries for use in C/C++–This uses a library of parallel routines15Functional•NESL, Multiplisp•Id & Sisal (more dataflow)16Parallelizing CompilersAutomatically transform a sequential program into a parallel program.1. Identify loops whose iterations can be executed in parallel. 2. Often done in stages.Q: Which loops can be run in parallel?Q: How should we distribute the work/data?17Data DependencesFlow dependence - RAW. Read-After-Write. A "true" dependence. Read a value after it has been written into a variable.Anti-dependence - WAR. Write-After-Read. Write a new value into a variable after the old value has been read.Output dependence - WAW. Write-After-Write. Write a new value into a variable and then later on write another value into the same variable.18Example1: A = 90;2: B = A;3: C = A + D4: A = 5;19DependenciesA parallelizing compiler must identify loops that do not have dependences BETWEEN ITERATIONS of the loop.Example:do I = 1, 1000 A(I) = B(I) + C(I) D(I) = A(I)end do20ExampleFork one thread for each processorEach thread executes the loop:do I = my_lo, my_hi A(I) = B(I) + C(I)D(I) = A(I) end doWait for all threads to finish before proceeding.21Another Exampledo I = 1, 1000 A(I) = B(I) + C(I) D(I) = A(I+1)end do22Yet Another Exampledo I = 1, 1000 A( X(I) ) = B(I) + C(I) D(I) = A( X(I) )end do23Parallel Compilers•Two concerns:•Parallelizing code–Compiler will move code around to uncover parallel operations•Data locality–If a parallel operation has to get data from another processor’s memory, that’s bad24Distributed computing•Take a big task that has natural parallelism•Split it up to may different computers across a network•Examples: SETI@Home, prime number searches, Google Compute, etc.•Distributed computing is a form of parallel
View Full Document