Berkeley COMPSCI C267 - Introduction to Parallel Machines and Programming Models - D2570900

Home> Schools> University of California, Berkeley> Computer Science (COMPSCI) > COMPSCI C267> Introduction to Parallel Machines and Programming Models

DOC PREVIEW

Berkeley COMPSCI C267 - Introduction to Parallel Machines and Programming Models

School name University of California, Berkeley

Course Compsci C267- Applications of Parallel Computers

Pages 47

This preview shows page 1-2-3-22-23-24-45-46-47 out of 47 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 47 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 47 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 47 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 47 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 47 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 47 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 47 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 47 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 47 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 47 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

CS 267: Introduction to Parallel Machines and Programming ModelsOutlineA generic parallel architectureParallel Programming ModelsSimple ExampleProgramming Model 1: Shared MemoryShared Memory Code for Computing a SumSlide 8Improved Code for Computing a SumMachine Model 1a: Shared MemoryProblems Scaling Shared MemoryPowerPoint PresentationMachine Model 1b: Distributed Shared MemoryProgramming Model 2: Message PassingComputing s = A[1]+A[2] on each processorMPI – the de facto standardMachine Model 2a: Distributed MemoryPC Clusters: Contributions of BeowulfOpen Source Software Model for HPCTflop/s ClustersInternet Computing- SETI@homeProgramming Model 2b: Global Addr SpaceMachine Model 2b: Global Address SpaceProgramming Model 3: Data ParallelMachine Model 3a: SIMD SystemModel 3b: Vector MachinesVector ProcessorsSlide 28Cray X1: Parallel Vector ArchitectureEarth Simulator ArchitectureMachine Model 4: Clusters of SMPsCluster of SMP ApproachSlide 33Slide 34Slide 3522nd List: The TOP10Continents PerformanceSlide 38Customer TypesManufacturersManufacturers PerformanceProcessor TypesArchitecturesNOW – ClustersAnalysis of TOP500 DataSummaryReading Assignment2/4/200408/29/2002 CS267 Lecure 5 1CS 267: Introduction to Parallel Machines and Programming ModelsKatherine [email protected] http://www.cs.berkeley.edu/~yelick/cs26702/04/2004CS267 Lecure 5 2Outline•Overview of parallel machines and programming models•Shared memory•Shared address space•Message passing•Data parallel•Clusters of SMPs•Trends in real machines02/04/2004CS267 Lecure 5 3A generic parallel architecturePPP PInterconnection NetworkM M MM° Where is the memory physically located?Memory02/04/2004CS267 Lecure 5 4Parallel Programming Models•Control•How is parallelism created?•What orderings exist between operations?•How do different threads of control synchronize?•Data•What data is private vs. shared?•How is logically shared data accessed or communicated?•Operations•What are the atomic (indivisible) operations?•Cost•How do we account for the cost of each of the above?02/04/2004CS267 Lecure 5 5Simple ExampleConsider a sum of an array function: •Parallel Decomposition: •Each evaluation and each partial sum is a task.•Assign n/p numbers to each of p procs•Each computes independent “private” results and partial sum.•One (or all) collects the p partial sums and computes the global sum.Two Classes of Data: •Logically Shared•The original n numbers, the global sum.•Logically Private•The individual function evaluations.•What about the individual partial sums?10])[(niiAf02/04/2004CS267 Lecure 5 6Programming Model 1: Shared Memory•Program is a collection of threads of control.•Can be created dynamically, mid-execution, in some languages•Each thread has a set of private variables, e.g., local stack variables •Also a set of shared variables, e.g., static variables, shared common blocks, or global heap.•Threads communicate implicitly by writing and reading shared variables.•Threads coordinate by synchronizing on shared variablesPnP1P0s s = ...y = ..s ...Shared memoryi: 2i: 5Private memoryi: 802/04/2004CS267 Lecure 5 7Shared Memory Code for Computing a SumThread 1 for i = 0, n/2-1 s = s + f(A[i])Thread 2 for i = n/2, n-1 s = s + f(A[i])static int s = 0;•Problem is a race condition on variable s in the program•A race condition or data race occurs when:-two processors (or two threads) access the same variable, and at least one does a write.-The accesses are concurrent (not synchronized) so they could happen simultaneously02/04/2004CS267 Lecure 5 8Shared Memory Code for Computing a SumThread 1 …. compute f([A[i]) and put in reg0 reg1 = s reg1 = reg1 + reg0 s = reg1 …Thread 2 … compute f([A[i]) and put in reg0 reg1 = s reg1 = reg1 + reg0 s = reg1 …static int s = 0;•Assume s=27, f(A[i])=7 on Thread1 and =9 on Thread2•For this program to work, s should be 43 at the end•but it may be 43, 34, or 36•The atomic operations are reads and writes•Never see ½ of one number•All computations happen in (private) registers7 9272734 36363402/04/2004CS267 Lecure 5 9Improved Code for Computing a SumThread 1 local_s1= 0 for i = 0, n/2-1 local_s1 = local_s1 + f(A[i]) s = s + local_s1 Thread 2 local_s2 = 0 for i = n/2, n-1 local_s2= local_s2 + f(A[i]) s = s +local_s2 static int s = 0;•Since addition is associative, it’s OK to rearrange order•Most computation is on private variables-Sharing frequency is also reduced, which might improve speed -But there is still a race condition on the update of shared s-The race condition can be fixed by adding locks (only one thread can hold a lock at a time; others wait for it)static lock lk;lock(lk);unlock(lk);lock(lk);unlock(lk);02/04/2004CS267 Lecure 5 10Machine Model 1a: Shared MemoryP1network/bus$memory•Processors all connected to a large shared memory.•Typically called Symmetric Multiprocessors (SMPs)•Sun, HP, Intel, IBM SMPs (nodes of Millennium, SP)•“Local” memory is not (usually) part of the hardware abstraction.•Difficulty scaling to large numbers of processors•<32 processors typical•Advantage: uniform memory access (UMA)•Cost: much cheaper to access data in cache than main memory.P2$Pn$02/04/2004CS267 Lecure 5 11Problems Scaling Shared Memory•Why not put more processors on (with larger memory?)•The memory bus becomes a bottleneck•Example from a Parallel Spectral Transform Shallow Water Model (PSTSWM) demonstrates the problem•Experimental results (and slide) from Pat Worley at ORNL•This is an important kernel in atmospheric models•99% of the floating point operations are multiplies or adds, which generally run well on all processors•But it does sweeps through memory with little reuse of operands, which exercises the memory system•These experiments show serial performance, with one “copy” of the code running independently on varying numbers of procs•The best case for shared memory: no sharing•But the data doesn’t all fit in the registers/cache02/04/2004CS267 Lecure 5 12From Pat Worley, ORNLExample: Problem in Scaling Shared Memory•Performance degradation is a “smooth” function of the number of processes.•No shared data between them, so there should be perfect parallelism.•(Code was run for a 18 vertical levels with a range of horizontal sizes.)02/04/2004CS267 Lecure 5 13Machine Model 1b: Distributed Shared

View Full Document

Berkeley COMPSCI C267 - Introduction to Parallel Machines and Programming Models

Sign up for free to view:

This document and 3 million+ documents and flashcards
High quality study guides, lecture notes, practice exams
Course Packets handpicked by editors offering a comprehensive review of your courses
Better Grades Guaranteed


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-3-22-23-24-45-46-47 out of 47 pages.

Berkeley COMPSCI C267 - Introduction to Parallel Machines and Programming Models

Sign up for free to view:

Please select your school