Berkeley COMPSCI C267 - Lecture 3: Introduction to Parallel Architectures and Programming Models - D2572224

Home> Schools> University of California, Berkeley> Computer Science (COMPSCI) > COMPSCI C267> Lecture 3: Introduction to Parallel Architectures and Programming Models

DOC PREVIEW

Berkeley COMPSCI C267 - Lecture 3: Introduction to Parallel Architectures and Programming Models

School name University of California, Berkeley

Course Compsci C267- Applications of Parallel Computers

Pages 19

This preview shows page 1-2-3-4-5-6 out of 19 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 19 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 19 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 19 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 19 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 19 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 19 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 19 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

1CS267 L3 Programming Models.1Demmel Sp 1999CS 267 Applications of Parallel ComputersLecture 3:Introduction to Parallel Architectures and Programming ModelsJim Demmelhttp://www.cs.berkeley.edu/~demmel/cs267_Spr99CS267 L3 Programming Models.2Demmel Sp 1999Recap of Last Lecture° The actual performance of a simple program can bea complicated function of the architecture° Slight changes in the architecture or program maychange the performance significantly° Since we want to write fast programs, we must takethe architecture into account, even onuniprocessors° Since the actual performance is so complicated, weneed simple models to help us design efficientalgorithms° We illustrated with a common technique forimproving cache performance, called blocking,applied to matrix multiplication• Blocking works for many architectures, but choosing theblocksize depends on the architecture2CS267 L3 Programming Models.3Demmel Sp 1999Outline° Parallel machines and programming models° Steps in writing a parallel program° Cost modeling and performance trade-offsCS267 L3 Programming Models.4Demmel Sp 1999ParallelMachines andProgrammingModels3CS267 L3 Programming Models.5Demmel Sp 1999A generic parallel architecturePPP PInterconnection NetworkM M MM° Where does the memory go?MemoryCS267 L3 Programming Models.6Demmel Sp 1999Parallel Programming Models° Control• how is parallelism created• what orderings exist between operations• how do different threads of control synchronize° Naming• what data is private vs. shared• how logically shared data is accessed or communicated° Set of operations• what are the basic operations• what operations are considered to be atomic° Cost• how do we account for the cost of each of the above4CS267 L3 Programming Models.7Demmel Sp 1999Trivial Example°° Parallel Decomposition:• Each evaluation and each partial sum is a task° Assign n/p numbers to each of p procs• each computes independent “private” results and partial sum• one (or all) collects the p partial sums and computes the globalsum=> Classes of Data° Logically Shared• the original n numbers, the global sum° Logically Private• the individual function evaluations• what about the individual partial sums?f A iin( [ ])=−∑01CS267 L3 Programming Models.8Demmel Sp 1999Programming Model 1° Shared Address Space• program consists of a collection of threads of control,• each with a set of private variables- e.g., local variables on the stack• collectively with a set of shared variables- e.g., static variables, shared common blocks, global heap• threads communicate implicity by writing and reading sharedvariables• threads coordinate explicitly by synchronization operations onshared variables- writing and reading flags- locks, semaphores° Like concurrent programmingon uniprocessoriressPPPiress. . .x = ...y = ..x ...A:SharedPrivate5CS267 L3 Programming Models.9Demmel Sp 1999Machine Model 1P1 P2 Pnnetwork$ $ $memory° A shared memory machine° Processors all connected to a large shared memory° “Local” memory is not (usually) part of the hardware• Sun, DEC, Intel “SMPs” (Symmetric multiprocessors) in Millennium; SGI Origin° Cost: much cheaper to cache than main memory° Machine model 1a: A Shared Address Space Machine• replace caches by local memories (in abstract machine model)• this affects the cost model -- repeatedly accessed data should becopied• Cray T3ECS267 L3 Programming Models.10Demmel Sp 1999Shared Memory code for computing a sumThread 1 [s = 0 initially] local_s1= 0 for i = 0, n/2-1 local_s1 = local_s1 + f(A[i]) s = s + local_s1Thread 2 [s = 0 initially] local_s2 = 0 for i = n/2, n-1 local_s2= local_s2 + f(A[i]) s = s +local_s2What could go wrong?6CS267 L3 Programming Models.11Demmel Sp 1999Pitfall and solution via synchronization° Pitfall in computing a global sum s = local_s1 + local_s2 Thread 1 (initially s=0) load s [from mem to reg] s = s+local_s1 [=local_s1, in reg] store s [from reg to mem]TimeThread 2 (initially s=0) load s [from mem to reg; initially 0] s = s+local_s2 [=local_s2, in reg] store s [from reg to mem]° Instructions from different threads can be interleaved arbitrarily° What can final result s stored in memory be?° Race Condition° Possible solution: Mutual Exclusion with LocksThread 1 lock load s s = s+local_s1 store s unlockThread 2 lock load s s = s+local_s2 store s unlock° Locks must be atomic (execute completely without interruption)CS267 L3 Programming Models.12Demmel Sp 1999Programming Model 2° Message Passing• program consists of a collection of named processes- thread of control plus local address space- local variables, static variables, common blocks, heap• processes communicate by explicit data transfers- matching pair of send & receive by source and dest. proc.• coordination is implicit in every communication event• logically shared data is partitioned over local processes° Like distributed programmingPPPiress. . .iresssend P0,Xrecv Pn,YXYA: A:n0° Program with standard libraries: MPI, PVM7CS267 L3 Programming Models.13Demmel Sp 1999Machine Model 2° A distributed memory machine• Cray T3E (too!), IBM SP2, NOW, Millennium° Processors all connected to own memory (andcaches)• cannot directly access another processor’s memory° Each “node” has a network interface (NI)• all communication and synchronization done through thisinterconnectP1memoryNI P2memoryNIPnmemoryNI. . .CS267 L3 Programming Models.14Demmel Sp 1999Computing s = x(1)+x(2) on each processorProcessor 1 send xlocal, proc2 [xlocal = x(1)] receive xremote, proc2 s = xlocal + xremote Processor 2 receive xremote, proc1 send xlocal, proc1 [xlocal = x(2)] s = xlocal + xremote° First possible solution° Second possible solution - what could go wrong?Processor 1 send xlocal, proc2 [xlocal = x(1)] receive xremote, proc2 s = xlocal + xremote Processor 2 send xlocal, proc1 [xlocal = x(2)] receive xremote, proc1 s = xlocal + xremote° What if send/receive act like the telephone system? The post office?8CS267 L3 Programming Models.15Demmel Sp 1999Programming Model 3° Data Parallel• Single sequential thread of control consisting of parallel operations• Parallel operations applied to all (or defined subset) of a datastructure• Communication is

View Full Document

Berkeley COMPSCI C267 - Lecture 3: Introduction to Parallel Architectures and Programming Models

Sign up for free to view:

This document and 3 million+ documents and flashcards
High quality study guides, lecture notes, practice exams
Course Packets handpicked by editors offering a comprehensive review of your courses
Better Grades Guaranteed


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-3-4-5-6 out of 19 pages.

Berkeley COMPSCI C267 - Lecture 3: Introduction to Parallel Architectures and Programming Models

Sign up for free to view:

Please select your school