CS 267 Applications of Parallel Computers Lecture 4: More about Shared Memory Processors and ProgrammingRecap of Last LectureOutlineCost Modeling and Performance TradeoffsExampleIdentifying enough ConcurrencyAlgorithmic Trade-offsProblem Size is CriticalSlide 9Slide 10Recap Basic Shared Memory ArchitectureLimits of using Bus as NetworkCache Coherence: The Semantic ProblemProgramming SMPsWhere are things goingSlide 16Programming with ThreadsCreating ParallelismLanguage Notions of Thread CreationForking Threads in SolarisSynchronizationBasic Types of Synchronization: BarrierBasic Types of Synchronization: MutexesSlide 24Slide 25Slide 26CS267 L4 Shared Memory.1Demmel Sp 1999CS 267 Applications of Parallel ComputersLecture 4:More about Shared Memory Processorsand ProgrammingJim Demmelhttp://www.cs.berkeley.edu/~demmel/cs267_Spr99CS267 L4 Shared Memory.2Demmel Sp 1999Recap of Last Lecture°There are several standard programming models (plus variations) that were developed to support particular kinds of architectures•shared memory•message passing•data parallel°The programming models are no longer strictly tied to particular architectures, and so offer portability of correctness°Portability of performance still depends on tuning for each architecture°In each model, parallel programming has 4 phases•decomposition into parallel tasks•assignment of tasks to threads•orchestration of communication and synchronization among threads•mapping threads to processorsCS267 L4 Shared Memory.3Demmel Sp 1999Outline°Performance modeling and tradeoffs°Shared memory architectures°Shared memory programmingCS267 L4 Shared Memory.4Demmel Sp 1999Cost Modeling and Performance TradeoffsCS267 L4 Shared Memory.5Demmel Sp 1999Example°s = f(A[1]) + … + f(A[n])f(A[1]) + … + f(A[n])°DecompositionDecomposition•computing each f(A[j])computing each f(A[j])•n-fold parallelism, where n may be >> pn-fold parallelism, where n may be >> p•computing sum scomputing sum s°AssignmentAssignment•thread k sums sk = f(A[k*n/p]) + … + f(A[(k+1)*n/p-1]) thread k sums sk = f(A[k*n/p]) + … + f(A[(k+1)*n/p-1]) •thread 1 sums s = s1+ … + sp thread 1 sums s = s1+ … + sp - for simplicity of this example, will be improvedfor simplicity of this example, will be improved•thread 1 communicates s to other threadsthread 1 communicates s to other threads°Orchestration Orchestration •starting up threads•communicating, synchronizing with thread 1°Mapping•processor j runs thread jCS267 L4 Shared Memory.6Demmel Sp 1999Identifying enough Concurrency°Amdahl’s law bounds speedup•let s = the fraction of total work done sequentiallySimple Decomposition: f ( A[i] ) is the parallel task sum is sequentialConcurrencyTime1 x time(sum(n))Speedup PssPs( ) 111Concurrencyp x n/p x time(f)01020304050607080901000 20 40 60 80 100ProcessorsSPeedupS=0%S=1%S=5%S=10%n x time(f)°Parallelism profile•area is total work doneAfter mappingnpCS267 L4 Shared Memory.7Demmel Sp 1999Algorithmic Trade-offs°Parallelize partial sum of the f’s•what fraction of the computation is “sequential”•what does this do for communication? locality?•what if you sum what you “own”Concurrencyp x n/p x time(f)p x time(sum(n/p) )1 x time(sum(p))CS267 L4 Shared Memory.8Demmel Sp 1999Problem Size is Critical°Total work= n + P°Serial work: P°Parallel work: n°s = serial fraction = P/ (n+P)°Speedup(P)=n/(n/P+P)°Speedup decreases for large P if n small01020304050607080901000 20 40 60 80 100ProcessorsSpeedup1000100001000000Amdahl’s Law BoundsIn general seek to exploit a fraction of the peak parallelismin the problem.nCS267 L4 Shared Memory.9Demmel Sp 1999Algorithmic Trade-offs°Parallelize the final summation (tree sum)•Generalize Amdahl’s law for arbitrary “ideal” parallelism profileConcurrencyp x n/p x time(f)p x time(sum(n/p) )log_2 p x time(sum(2))CS267 L4 Shared Memory.10Demmel Sp 1999Shared Memory ArchitecturesCS267 L4 Shared Memory.11Demmel Sp 1999Recap Basic Shared Memory ArchitectureP1 P2 Pnnetwork$ $ $memory°Processors all connected to a large shared memory°Local caches for each processor°Cost: much cheaper to cache than main memory° Simplest to program, but hard to build with many processorsSimplest to program, but hard to build with many processors° Now take a closer look at structure, costs, limitsNow take a closer look at structure, costs, limitsCS267 L4 Shared Memory.12Demmel Sp 1999Limits of using Bus as NetworkI/O MEM MEM° ° °PROC cachePROC cache° ° °Assume 100 MB/s bus50 MIPS processor w/o cache=> 200 MB/s inst BW per processor=> 60 MB/s data BW at 30% load-storeSuppose 98% inst hit rate and 95% data hit rate (16 byte block)=> 4 MB/s inst BW per processor=> 12 MB/s data BW per processor=> 16 MB/s combined BW 8 processors will saturate busCache provides bandwidth filter – as well as reducing average access time260 MB/s16 MB/sCS267 L4 Shared Memory.13Demmel Sp 1999Cache Coherence: The Semantic Problem°p1 and p2 both have cached copies of x (as 0)°p1 writes x=1 and then the flag, f=1, as a signal to other processors that it has updated x•writing f pulls it into p1’s cache•both of these writes “write through” to memory°p2 reads f (bringing it into p2’s cache) to see if it is 1, which it is°p2 therefore reads x, expecting the value written by p1, but gets the “stale” cached copyx 1f 1x 0f 1x = 1f = 1p1 p2° SMPs have complicated caches to enforce coherenceSMPs have complicated caches to enforce coherenceCS267 L4 Shared Memory.14Demmel Sp 1999Programming SMPs°Coherent view of shared memory°All addresses equidistant•don’t worry about data partitioning°Caches automatically replicate shared data close to processor°If program concentrates on a block of the data set that no one else updates => very fast°Communication occurs only on cache misses•cache misses are slow°Processor cannot distinguish communication misses from regular cache misses°Cache block may introduce unnecessary communication•two distinct variables in the same cache block•false sharingCS267 L4 Shared Memory.15Demmel Sp 1999Where are things going°High-end•collections of almost complete workstations/SMP on high-speed network (Millennium)•with specialized communication assist integrated with memory system to provide global access to shared data°Mid-end•almost all servers are bus-based CC SMPs•high-end servers are replacing the
View Full Document