DOC PREVIEW
Berkeley COMPSCI C267 - Lecture 4: More about Shared Memory Processors and Programming

This preview shows page 1-2-3-19-20-38-39-40 out of 40 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 40 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 40 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 40 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 40 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 40 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 40 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 40 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 40 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 40 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

CS 267 Applications of Parallel Computers Lecture 4: More about Shared Memory Processors and ProgrammingRecap of Last LectureOutlineCost Modeling and Performance TradeoffsExampleIdentifying enough ConcurrencyAlgorithmic Trade-offsProblem Size is CriticalSlide 9Slide 10Recap Basic Shared Memory ArchitectureLimits of using Bus as NetworkCache Coherence: The Semantic ProblemProgramming SMPsWhere are things goingSlide 16Programming with ThreadsCreating ParallelismLanguage Notions of Thread CreationForking Threads in SolarisExample Using Solaris ThreadsSynchronizationBasic Types of Synchronization: BarrierBasic Types of Synchronization: MutexesBasic Types of Synchronization: Post/WaitSynchronization at Different LevelsSolaris Threads ExampleProducer-Consumer SynchronizationRolling Your Own SynchronizationBut Machines aren’t Always Sequentially ConsistencyProgramming SolutionsFoundation Behind Sequential ConsistencyBuilding Shared Data StructuresShared Address AllocationBuilding Parallel Data StructuresData StructuresSummaryUniform Shared Address SpaceSegmented Shared Address SpaceWork vs. Data AssignmentCS267 L4 Shared Memory.1Demmel Sp 1999CS 267 Applications of Parallel ComputersLecture 4:More about Shared Memory Processorsand ProgrammingJim Demmelhttp://www.cs.berkeley.edu/~demmel/cs267_Spr99CS267 L4 Shared Memory.2Demmel Sp 1999Recap of Last Lecture°There are several standard programming models (plus variations) that were developed to support particular kinds of architectures•shared memory•message passing•data parallel°The programming models are no longer strictly tied to particular architectures, and so offer portability of correctness°Portability of performance still depends on tuning for each architecture°In each model, parallel programming has 4 phases•decomposition into parallel tasks•assignment of tasks to threads•orchestration of communication and synchronization among threads•mapping threads to processorsCS267 L4 Shared Memory.3Demmel Sp 1999Outline°Performance modeling and tradeoffs°Shared memory architectures°Shared memory programmingCS267 L4 Shared Memory.4Demmel Sp 1999Cost Modeling and Performance TradeoffsCS267 L4 Shared Memory.5Demmel Sp 1999Example°s = f(A[1]) + … + f(A[n])f(A[1]) + … + f(A[n])°DecompositionDecomposition•computing each f(A[j])computing each f(A[j])•n-fold parallelism, where n may be >> pn-fold parallelism, where n may be >> p•computing sum scomputing sum s°AssignmentAssignment•thread k sums sk = f(A[k*n/p]) + … + f(A[(k+1)*n/p-1]) thread k sums sk = f(A[k*n/p]) + … + f(A[(k+1)*n/p-1]) •thread 1 sums s = s1+ … + sp thread 1 sums s = s1+ … + sp - for simplicity of this example, will be improvedfor simplicity of this example, will be improved•thread 1 communicates s to other threadsthread 1 communicates s to other threads°Orchestration Orchestration •starting up threads•communicating, synchronizing with thread 1°Mapping•processor j runs thread jCS267 L4 Shared Memory.6Demmel Sp 1999Identifying enough Concurrency°Amdahl’s law bounds speedup•let s = the fraction of total work done sequentiallySimple Decomposition: f ( A[i] ) is the parallel task sum is sequentialConcurrencyTime1 x time(sum(n))Speedup PssPs( ) 111Concurrencyp x n/p x time(f)01020304050607080901000 20 40 60 80 100ProcessorsSPeedupS=0%S=1%S=5%S=10%n x time(f)°Parallelism profile•area is total work doneAfter mappingnpCS267 L4 Shared Memory.7Demmel Sp 1999Algorithmic Trade-offs°Parallelize partial sum of the f’s•what fraction of the computation is “sequential”•what does this do for communication? locality?•what if you sum what you “own”Concurrencyp x n/p x time(f)p x time(sum(n/p) )1 x time(sum(p))CS267 L4 Shared Memory.8Demmel Sp 1999Problem Size is Critical°Total work= n + P°Serial work: P°Parallel work: n°s = serial fraction = P/ (n+P)°Speedup(P)=n/(n/P+P)°Speedup decreases for large P if n small01020304050607080901000 20 40 60 80 100ProcessorsSpeedup1000100001000000Amdahl’s Law BoundsIn general seek to exploit a fraction of the peak parallelismin the problem.nCS267 L4 Shared Memory.9Demmel Sp 1999Algorithmic Trade-offs°Parallelize the final summation (tree sum)•Generalize Amdahl’s law for arbitrary “ideal” parallelism profileConcurrencyp x n/p x time(f)p x time(sum(n/p) )log_2 p x time(sum(2))CS267 L4 Shared Memory.10Demmel Sp 1999Shared Memory ArchitecturesCS267 L4 Shared Memory.11Demmel Sp 1999Recap Basic Shared Memory ArchitectureP1 P2 Pnnetwork$ $ $memory°Processors all connected to a large shared memory°Local caches for each processor°Cost: much cheaper to cache than main memory° Simplest to program, but hard to build with many processorsSimplest to program, but hard to build with many processors° Now take a closer look at structure, costs, limitsNow take a closer look at structure, costs, limitsCS267 L4 Shared Memory.12Demmel Sp 1999Limits of using Bus as NetworkI/O MEM MEM° ° °PROC cachePROC cache° ° °Assume 100 MB/s bus50 MIPS processor w/o cache=> 200 MB/s inst BW per processor=> 60 MB/s data BW at 30% load-storeSuppose 98% inst hit rate and 95% data hit rate (16 byte block)=> 4 MB/s inst BW per processor=> 12 MB/s data BW per processor=> 16 MB/s combined BW 8 processors will saturate busCache provides bandwidth filter – as well as reducing average access time260 MB/s16 MB/sCS267 L4 Shared Memory.13Demmel Sp 1999Cache Coherence: The Semantic Problem°p1 and p2 both have cached copies of x (as 0)°p1 writes x=1 and then the flag, f=1, as a signal to other processors that it has updated x•writing f pulls it into p1’s cache•both of these writes “write through” to memory°p2 reads f (bringing it into p2’s cache) to see if it is 1, which it is°p2 therefore reads x, expecting the value written by p1, but gets the “stale” cached copyx 1f 1x 0f 1x = 1f = 1p1 p2° SMPs have complicated caches to enforce coherenceSMPs have complicated caches to enforce coherenceCS267 L4 Shared Memory.14Demmel Sp 1999Programming SMPs°Coherent view of shared memory°All addresses equidistant•don’t worry about data partitioning°Caches automatically replicate shared data close to processor°If program concentrates on a block of the data set that no one else updates => very fast°Communication occurs only on cache misses•cache misses are slow°Processor cannot distinguish communication misses from regular cache


View Full Document

Berkeley COMPSCI C267 - Lecture 4: More about Shared Memory Processors and Programming

Documents in this Course
Lecture 4

Lecture 4

52 pages

Split-C

Split-C

5 pages

Lecture 5

Lecture 5

40 pages

Load more
Download Lecture 4: More about Shared Memory Processors and Programming
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture 4: More about Shared Memory Processors and Programming and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture 4: More about Shared Memory Processors and Programming 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?