Berkeley COMPSCI C267 - Introduction to Parallel Machines and Programming Models - D520580

Home> Schools> University of California, Berkeley> Computer Science (COMPSCI) > COMPSCI C267> Introduction to Parallel Machines and Programming Models

DOC PREVIEW

Berkeley COMPSCI C267 - Introduction to Parallel Machines and Programming Models

School name University of California, Berkeley

Course Compsci C267- Applications of Parallel Computers

Pages 52

This preview shows page 1-2-3-25-26-27-28-50-51-52 out of 52 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 52 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 52 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 52 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 52 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 52 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 52 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 52 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 52 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 52 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 52 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 52 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

CS 267 Introduction to Parallel Machines and Programming Models Kathy Yelick yelick cs berkeley edu www cs berkeley edu yelick cs267 sp07 01 25 2007 CS267 Lecture 4 1 Outline Overview of parallel machines hardware and programming models software Shared memory Shared address space Message passing Data parallel Clusters of SMPs Grid Parallel machine may or may not be tightly coupled to programming model Historically tight coupling Today portability is important 01 25 2007 Trends in real machines CS267 Lecture 4 2 A generic parallel architecture Proc Proc Proc Proc Proc Proc Interconnection Network Memory Memory Memory Memory Memory Where is the memory physically located Is it connect directly to processors What is the connectivity of the network 01 25 2007 CS267 Lecture 4 3 Parallel Programming Models Programming model is made up of the languages and libraries that create an abstract view of the machine Control How is parallelism created What orderings exist between operations How do different threads of control synchronize Data What data is private vs shared How is logically shared data accessed or communicated Synchronization What operations can be used to coordinate parallelism What are the atomic indivisible operations Cost How do we account for the cost of each of the above 01 25 2007 CS267 Lecture 4 4 Simple Example Consider applying a function f to the elements of an array A and then computing its sum n 1 f A i i 0 Questions Where does A live All in single memory Partitioned What work will be done by each processors They need to coordinate to get a single result how A array of all data fA f A s sum fA A f fA sum s 01 25 2007 CS267 Lecture 4 5 Programming Model 1 Shared Memory Program is a collection of threads of control Can be created dynamically mid execution in some languages Each thread has a set of private variables e g local stack variables Also a set of shared variables e g static variables shared common blocks or global heap Threads communicate implicitly by writing and reading shared variables Threads coordinate by synchronizing on shared variables Shared memory s s y s 01 25 2007 i 2 i 5 P0 P1 i 8 Private memory CS267 Lecture 4 Pn 6 Simple Example Shared memory strategy small number p n size A processors attached to single memory Parallel Decomposition n 1 f A i i 0 Each evaluation and each partial sum is a task Assign n p numbers to each of p procs Each computes independent private results and partial sum Collect the p partial sums and compute a global sum Two Classes of Data Logically Shared The original n numbers the global sum Logically Private The individual function evaluations What about the individual partial sums 01 25 2007 CS267 Lecture 4 7 Shared Memory Code for Computing a Sum static int s 0 Thread 1 for i 0 n 2 1 s s f A i Thread 2 for i n 2 n 1 s s f A i Problem is a race condition on variable s in the program A race condition or data race occurs when two processors or two threads access the same variable and at least one does a write The accesses are concurrent not synchronized so they could happen simultaneously 01 25 2007 CS267 Lecture 4 8 Shared Memory Code for Computing a Sum A 3 5 f square static int s 0 Thread 1 compute f A i and put in reg0 reg1 s reg1 reg1 reg0 s reg1 9 0 9 9 Thread 2 compute f A i and put in reg0 reg1 s reg1 reg1 reg0 s reg1 25 0 25 25 Assume A 3 5 f is the square function and s 0 initially For this program to work s should be 34 at the end but it may be 34 9 or 25 The atomic operations are reads and writes Never see of one number but no operation is not atomic All computations happen in private registers 01 25 2007 CS267 Lecture 4 9 Improved Code for Computing a Sum static int s 0 static lock lk Thread 1 Thread 2 local s1 0 for i 0 n 2 1 local s1 local s1 f A i lock lk s s local s1 unlock lk local s2 0 for i n 2 n 1 local s2 local s2 f A i lock lk s s local s2 unlock lk Since addition is associative it s OK to rearrange order Most computation is on private variables Sharing frequency is also reduced which might improve speed But there is still a race condition on the update of shared s The race condition can be fixed by adding locks only one thread can hold a lock at a time others wait for it 01 25 2007 CS267 Lecture 4 10 Machine Model 1a Shared Memory Processors all connected to a large shared memory Typically called Symmetric Multiprocessors SMPs SGI Sun HP Intel IBM SMPs nodes of Millennium SP Multicore chips except that all caches are shared Difficulty scaling to large numbers of processors 32 processors typical Advantage uniform memory access UMA Cost much cheaper to access data in cache than main memory P2 P1 Pn bus shared memory 01 25 2007 CS267 Lecture 4 Note cache 11 Problems Scaling Shared Memory Hardware Why not put more processors on with larger memory The memory bus becomes a bottleneck Caches need to be kept coherent Example from a Parallel Spectral Transform Shallow Water Model PSTSWM demonstrates the problem Experimental results and slide from Pat Worley at ORNL This is an important kernel in atmospheric models 99 of the floating point operations are multiplies or adds which generally run well on all processors But it does sweeps through memory with little reuse of operands so uses bus and shared memory frequently These experiments show serial performance with one copy of the code running independently on varying numbers of procs 01 25 2007 The best case for shared memory no sharing But the data doesn t all fit in the registers cache CS267 Lecture 4 12 Example Problem in Scaling Shared Memory Performance degradation is a smooth function of the number of processes No shared data between them so there should be perfect parallelism Code was run for a 18 vertical levels with a range of horizontal sizes 01 25 2007 CS267 Lecture 4 From Pat Worley ORNL 13 Machine Model 1b Multithreaded Processor Multiple thread contexts without full processors Memory and some other state is shared Sun Niagra processor for servers Up to 32 threads all running simultaneously In addition to sharing memory they share floating point units Why Switch between threads for long latency memory operations Cray MTA and Eldorado processors for HPC T0 T1 Tn shared shared floating point units etc Memory 01 25 2007 CS267 Lecture 4 14 Eldorado Processor logical view 1 2 i n 3 Subproblem A i 3 i 2 i 1 Subproblem B 4 Programs running in parallel i n Serial Code i 1 i 0 Concurrent threads of computation Subproblem A Hardware streams 128 Unused

View Full Document

Berkeley COMPSCI C267 - Introduction to Parallel Machines and Programming Models

Sign up for free to view:

This document and 3 million+ documents and flashcards
High quality study guides, lecture notes, practice exams
Course Packets handpicked by editors offering a comprehensive review of your courses
Better Grades Guaranteed


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-3-25-26-27-28-50-51-52 out of 52 pages.

Berkeley COMPSCI C267 - Introduction to Parallel Machines and Programming Models

Sign up for free to view:

Please select your school