Berkeley COMPSCI C267 - Lecture 4: More about Shared Memory Processors and Programming - D2176280

Home> Schools> University of California, Berkeley> Computer Science (COMPSCI) > COMPSCI C267> Lecture 4: More about Shared Memory Processors and Programming

DOC PREVIEW

Berkeley COMPSCI C267 - Lecture 4: More about Shared Memory Processors and Programming

School name University of California, Berkeley

Course Compsci C267- Applications of Parallel Computers

Pages 40

This preview shows page 1-2-3-19-20-38-39-40 out of 40 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 40 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 40 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 40 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 40 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 40 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 40 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 40 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 40 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 40 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

CS 267 Applications of Parallel Computers Lecture 4: More about Shared Memory Processors and ProgrammingRecap of Last LectureOutlineCost Modeling and Performance TradeoffsExampleIdentifying enough ConcurrencyAlgorithmic Trade-offsProblem Size is CriticalSlide 9Slide 10Recap Basic Shared Memory ArchitectureLimits of using Bus as NetworkCache Coherence: The Semantic ProblemProgramming SMPsWhere are things goingSlide 16Programming with ThreadsCreating ParallelismLanguage Notions of Thread CreationForking Threads in SolarisExample Using Solaris ThreadsSynchronizationBasic Types of Synchronization: BarrierBasic Types of Synchronization: MutexesBasic Types of Synchronization: Post/WaitSynchronization at Different LevelsSolaris Threads ExampleProducer-Consumer SynchronizationRolling Your Own SynchronizationBut Machines aren’t Always Sequentially ConsistencyProgramming SolutionsFoundation Behind Sequential ConsistencyBuilding Shared Data StructuresShared Address AllocationBuilding Parallel Data StructuresData StructuresSummaryUniform Shared Address SpaceSegmented Shared Address SpaceWork vs. Data AssignmentCS267 L4 Shared Memory.1Demmel Sp 1999CS 267 Applications of Parallel ComputersLecture 4:More about Shared Memory Processorsand ProgrammingJim Demmelhttp://www.cs.berkeley.edu/~demmel/cs267_Spr99CS267 L4 Shared Memory.2Demmel Sp 1999Recap of Last Lecture°There are several standard programming models (plus variations) that were developed to support particular kinds of architectures•shared memory•message passing•data parallel°The programming models are no longer strictly tied to particular architectures, and so offer portability of correctness°Portability of performance still depends on tuning for each architecture°In each model, parallel programming has 4 phases•decomposition into parallel tasks•assignment of tasks to threads•orchestration of communication and synchronization among threads•mapping threads to processorsCS267 L4 Shared Memory.3Demmel Sp 1999Outline°Performance modeling and tradeoffs°Shared memory architectures°Shared memory programmingCS267 L4 Shared Memory.4Demmel Sp 1999Cost Modeling and Performance TradeoffsCS267 L4 Shared Memory.5Demmel Sp 1999Example°s = f(A[1]) + … + f(A[n])f(A[1]) + … + f(A[n])°DecompositionDecomposition•computing each f(A[j])computing each f(A[j])•n-fold parallelism, where n may be >> pn-fold parallelism, where n may be >> p•computing sum scomputing sum s°AssignmentAssignment•thread k sums sk = f(A[k*n/p]) + … + f(A[(k+1)*n/p-1]) thread k sums sk = f(A[k*n/p]) + … + f(A[(k+1)*n/p-1]) •thread 1 sums s = s1+ … + sp thread 1 sums s = s1+ … + sp - for simplicity of this example, will be improvedfor simplicity of this example, will be improved•thread 1 communicates s to other threadsthread 1 communicates s to other threads°Orchestration Orchestration •starting up threads•communicating, synchronizing with thread 1°Mapping•processor j runs thread jCS267 L4 Shared Memory.6Demmel Sp 1999Identifying enough Concurrency°Amdahl’s law bounds speedup•let s = the fraction of total work done sequentiallySimple Decomposition: f ( A[i] ) is the parallel task sum is sequentialConcurrencyTime1 x time(sum(n))Speedup PssPs( ) 111Concurrencyp x n/p x time(f)01020304050607080901000 20 40 60 80 100ProcessorsSPeedupS=0%S=1%S=5%S=10%n x time(f)°Parallelism profile•area is total work doneAfter mappingnpCS267 L4 Shared Memory.7Demmel Sp 1999Algorithmic Trade-offs°Parallelize partial sum of the f’s•what fraction of the computation is “sequential”•what does this do for communication? locality?•what if you sum what you “own”Concurrencyp x n/p x time(f)p x time(sum(n/p) )1 x time(sum(p))CS267 L4 Shared Memory.8Demmel Sp 1999Problem Size is Critical°Total work= n + P°Serial work: P°Parallel work: n°s = serial fraction = P/ (n+P)°Speedup(P)=n/(n/P+P)°Speedup decreases for large P if n small01020304050607080901000 20 40 60 80 100ProcessorsSpeedup1000100001000000Amdahl’s Law BoundsIn general seek to exploit a fraction of the peak parallelismin the problem.nCS267 L4 Shared Memory.9Demmel Sp 1999Algorithmic Trade-offs°Parallelize the final summation (tree sum)•Generalize Amdahl’s law for arbitrary “ideal” parallelism profileConcurrencyp x n/p x time(f)p x time(sum(n/p) )log_2 p x time(sum(2))CS267 L4 Shared Memory.10Demmel Sp 1999Shared Memory ArchitecturesCS267 L4 Shared Memory.11Demmel Sp 1999Recap Basic Shared Memory ArchitectureP1 P2 Pnnetwork$ $ $memory°Processors all connected to a large shared memory°Local caches for each processor°Cost: much cheaper to cache than main memory° Simplest to program, but hard to build with many processorsSimplest to program, but hard to build with many processors° Now take a closer look at structure, costs, limitsNow take a closer look at structure, costs, limitsCS267 L4 Shared Memory.12Demmel Sp 1999Limits of using Bus as NetworkI/O MEM MEM° ° °PROC cachePROC cache° ° °Assume 100 MB/s bus50 MIPS processor w/o cache=> 200 MB/s inst BW per processor=> 60 MB/s data BW at 30% load-storeSuppose 98% inst hit rate and 95% data hit rate (16 byte block)=> 4 MB/s inst BW per processor=> 12 MB/s data BW per processor=> 16 MB/s combined BW 8 processors will saturate busCache provides bandwidth filter – as well as reducing average access time260 MB/s16 MB/sCS267 L4 Shared Memory.13Demmel Sp 1999Cache Coherence: The Semantic Problem°p1 and p2 both have cached copies of x (as 0)°p1 writes x=1 and then the flag, f=1, as a signal to other processors that it has updated x•writing f pulls it into p1’s cache•both of these writes “write through” to memory°p2 reads f (bringing it into p2’s cache) to see if it is 1, which it is°p2 therefore reads x, expecting the value written by p1, but gets the “stale” cached copyx 1f 1x 0f 1x = 1f = 1p1 p2° SMPs have complicated caches to enforce coherenceSMPs have complicated caches to enforce coherenceCS267 L4 Shared Memory.14Demmel Sp 1999Programming SMPs°Coherent view of shared memory°All addresses equidistant•don’t worry about data partitioning°Caches automatically replicate shared data close to processor°If program concentrates on a block of the data set that no one else updates => very fast°Communication occurs only on cache misses•cache misses are slow°Processor cannot distinguish communication misses from regular cache

View Full Document

Berkeley COMPSCI C267 - Lecture 4: More about Shared Memory Processors and Programming

Sign up for free to view:

This document and 3 million+ documents and flashcards
High quality study guides, lecture notes, practice exams
Course Packets handpicked by editors offering a comprehensive review of your courses
Better Grades Guaranteed


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-3-19-20-38-39-40 out of 40 pages.

Berkeley COMPSCI C267 - Lecture 4: More about Shared Memory Processors and Programming

Sign up for free to view:

Please select your school