1CS267 L3 Programming Models.1Demmel Sp 1999CS 267 Applications of Parallel ComputersLecture 3:Introduction to Parallel Architectures and Programming ModelsJim Demmelhttp://www.cs.berkeley.edu/~demmel/cs267_Spr99CS267 L3 Programming Models.2Demmel Sp 1999Recap of Last Lecture° The actual performance of a simple program can bea complicated function of the architecture° Slight changes in the architecture or program maychange the performance significantly° Since we want to write fast programs, we must takethe architecture into account, even onuniprocessors° Since the actual performance is so complicated, weneed simple models to help us design efficientalgorithms° We illustrated with a common technique forimproving cache performance, called blocking,applied to matrix multiplication• Blocking works for many architectures, but choosing theblocksize depends on the architecture2CS267 L3 Programming Models.3Demmel Sp 1999Outline° Parallel machines and programming models° Steps in writing a parallel program° Cost modeling and performance trade-offsCS267 L3 Programming Models.4Demmel Sp 1999ParallelMachines andProgrammingModels3CS267 L3 Programming Models.5Demmel Sp 1999A generic parallel architecturePPP PInterconnection NetworkM M MM° Where does the memory go?MemoryCS267 L3 Programming Models.6Demmel Sp 1999Parallel Programming Models° Control• how is parallelism created• what orderings exist between operations• how do different threads of control synchronize° Naming• what data is private vs. shared• how logically shared data is accessed or communicated° Set of operations• what are the basic operations• what operations are considered to be atomic° Cost• how do we account for the cost of each of the above4CS267 L3 Programming Models.7Demmel Sp 1999Trivial Example°° Parallel Decomposition:• Each evaluation and each partial sum is a task° Assign n/p numbers to each of p procs• each computes independent “private” results and partial sum• one (or all) collects the p partial sums and computes the globalsum=> Classes of Data° Logically Shared• the original n numbers, the global sum° Logically Private• the individual function evaluations• what about the individual partial sums?f A iin( [ ])=−∑01CS267 L3 Programming Models.8Demmel Sp 1999Programming Model 1° Shared Address Space• program consists of a collection of threads of control,• each with a set of private variables- e.g., local variables on the stack• collectively with a set of shared variables- e.g., static variables, shared common blocks, global heap• threads communicate implicity by writing and reading sharedvariables• threads coordinate explicitly by synchronization operations onshared variables- writing and reading flags- locks, semaphores° Like concurrent programmingon uniprocessoriressPPPiress. . .x = ...y = ..x ...A:SharedPrivate5CS267 L3 Programming Models.9Demmel Sp 1999Machine Model 1P1 P2 Pnnetwork$ $ $memory° A shared memory machine° Processors all connected to a large shared memory° “Local” memory is not (usually) part of the hardware• Sun, DEC, Intel “SMPs” (Symmetric multiprocessors) in Millennium; SGI Origin° Cost: much cheaper to cache than main memory° Machine model 1a: A Shared Address Space Machine• replace caches by local memories (in abstract machine model)• this affects the cost model -- repeatedly accessed data should becopied• Cray T3ECS267 L3 Programming Models.10Demmel Sp 1999Shared Memory code for computing a sumThread 1 [s = 0 initially] local_s1= 0 for i = 0, n/2-1 local_s1 = local_s1 + f(A[i]) s = s + local_s1Thread 2 [s = 0 initially] local_s2 = 0 for i = n/2, n-1 local_s2= local_s2 + f(A[i]) s = s +local_s2What could go wrong?6CS267 L3 Programming Models.11Demmel Sp 1999Pitfall and solution via synchronization° Pitfall in computing a global sum s = local_s1 + local_s2 Thread 1 (initially s=0) load s [from mem to reg] s = s+local_s1 [=local_s1, in reg] store s [from reg to mem]TimeThread 2 (initially s=0) load s [from mem to reg; initially 0] s = s+local_s2 [=local_s2, in reg] store s [from reg to mem]° Instructions from different threads can be interleaved arbitrarily° What can final result s stored in memory be?° Race Condition° Possible solution: Mutual Exclusion with LocksThread 1 lock load s s = s+local_s1 store s unlockThread 2 lock load s s = s+local_s2 store s unlock° Locks must be atomic (execute completely without interruption)CS267 L3 Programming Models.12Demmel Sp 1999Programming Model 2° Message Passing• program consists of a collection of named processes- thread of control plus local address space- local variables, static variables, common blocks, heap• processes communicate by explicit data transfers- matching pair of send & receive by source and dest. proc.• coordination is implicit in every communication event• logically shared data is partitioned over local processes° Like distributed programmingPPPiress. . .iresssend P0,Xrecv Pn,YXYA: A:n0° Program with standard libraries: MPI, PVM7CS267 L3 Programming Models.13Demmel Sp 1999Machine Model 2° A distributed memory machine• Cray T3E (too!), IBM SP2, NOW, Millennium° Processors all connected to own memory (andcaches)• cannot directly access another processor’s memory° Each “node” has a network interface (NI)• all communication and synchronization done through thisinterconnectP1memoryNI P2memoryNIPnmemoryNI. . .CS267 L3 Programming Models.14Demmel Sp 1999Computing s = x(1)+x(2) on each processorProcessor 1 send xlocal, proc2 [xlocal = x(1)] receive xremote, proc2 s = xlocal + xremote Processor 2 receive xremote, proc1 send xlocal, proc1 [xlocal = x(2)] s = xlocal + xremote° First possible solution° Second possible solution - what could go wrong?Processor 1 send xlocal, proc2 [xlocal = x(1)] receive xremote, proc2 s = xlocal + xremote Processor 2 send xlocal, proc1 [xlocal = x(2)] receive xremote, proc1 s = xlocal + xremote° What if send/receive act like the telephone system? The post office?8CS267 L3 Programming Models.15Demmel Sp 1999Programming Model 3° Data Parallel• Single sequential thread of control consisting of parallel operations• Parallel operations applied to all (or defined subset) of a datastructure• Communication is
View Full Document