Berkeley COMPSCI C267 - Lecture Notes - D1749073

Home> Schools> University of California, Berkeley> Computer Science (COMPSCI) > COMPSCI C267> Lecture Notes

DOC PREVIEW

Berkeley COMPSCI C267 - Lecture Notes

School name University of California, Berkeley

Course Compsci C267- Applications of Parallel Computers

Pages 57

This preview shows page 1-2-3-4-26-27-28-54-55-56-57 out of 57 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 57 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 57 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 57 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 57 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 57 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 57 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 57 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 57 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 57 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 57 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 57 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 57 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

CS 267: Introduction to Parallel Machines and Programming Models Lecture 3AnnouncementsOutlineA generic parallel architectureParallel Programming ModelsSimple ExampleProgramming Model 1: Shared MemorySlide 8Shared Memory “Code” for Computing a SumSlide 10Improved Code for Computing a SumMachine Model 1a: Shared MemoryProblems Scaling Shared Memory HardwareSlide 14Machine Model 1b: Multithreaded ProcessorMachine Model 1c: Distributed Shared MemoryProgramming Model 2: Message PassingComputing s = A[1]+A[2] on each processorMPI – the de facto standardMachine Model 2a: Distributed MemoryPC Clusters: Contributions of BeowulfTflop/s and Pflop/s ClustersMachine Model 2b: Internet/Grid ComputingProgramming Model 2a: Global Address SpaceMachine Model 2c: Global Address SpaceProgramming Model 3: Data ParallelMachine Model 3a: SIMD SystemMachine Model 3b: Vector MachinesVector ProcessorsCray X1: Parallel Vector ArchitectureEarth Simulator ArchitectureProgramming Model 4: HybridsMachine Model 4: Clusters of SMPsSlide 36Slide 37EXTRA SLIDES (TOP 500 FROM NOV 2009)Agenda34th List: The TOP10Jaguar @ ORNL: 1.75 PF/sRoadrunner @ LANL: 1.04 PF/s34th List: Notable (New) SystemsPerformance DevelopmentProjected Performance DevelopmentReplacement RateVendors / System ShareVendorsProcessor Architecture / SystemsOperating SystemsArchitecturesArchitectures (TOP50)Processors / SystemsProcessors / PerformanceCores per SocketSlide 68Cluster InterconnectsInterconnect FamilyAbsolute Power Levels01/28/2009 CS267 Lecture 31CS 267: Introduction to Parallel Machines and Programming ModelsLecture 3James Demmel and Horst Simonhttp://www.cs.berkeley.edu/~demmel/cs267_Spr10/01/26/2010Announcements•Seminar: "Models, Algorithms, and Software: Tradeoffs in the Design of High-Performance Computational Simulations in Science and Engineering”, Phil Colella, LBNL, 4-5pm on Thursday, Jan 28 in 306 Soda Hall.•Cray XT-5 workshop: Feb. 1 – 3, 2010, UC Berkeley, 250 Sutardja Dai Hall http://www.nersc.gov/projects/workshops/CrayXT/CS267 Lecture 3201/26/2010CS267 Lecture 33Outline•Overview of parallel machines (~hardware) and programming models (~software)•Shared memory•Shared address space•Message passing•Data parallel•Clusters of SMPs•Grid•Parallel machine may or may not be tightly coupled to programming model•Historically, tight coupling•Today, portability is important•Trends in real machines01/26/2010CS267 Lecture 34A generic parallel architectureProcInterconnection Network•Where is the memory physically located?•Is it connect directly to processors?•What is the connectivity of the network?MemoryProcProcProcProc ProcMemoryMemoryMemory Memory01/26/2010CS267 Lecture 35Parallel Programming Models•Programming model is made up of the languages and libraries that create an abstract view of the machine•Control•How is parallelism created?•What orderings exist between operations?•Data•What data is private vs. shared?•How is logically shared data accessed or communicated?•Synchronization•What operations can be used to coordinate parallelism?•What are the atomic (indivisible) operations?•Cost•How do we account for the cost of each of the above?01/26/2010CS267 Lecture 36Simple Example•Consider applying a function f to the elements of an array A and then computing its sum: •Questions:•Where does A live? All in single memory? Partitioned?•What work will be done by each processors?•They need to coordinate to get a single result, how?10])[(niiAfA:fA:fsumA = array of all datafA = f(A)s = sum(fA)s:01/26/2010CS267 Lecture 37Programming Model 1: Shared Memory•Program is a collection of threads of control.•Can be created dynamically, mid-execution, in some languages•Each thread has a set of private variables, e.g., local stack variables •Also a set of shared variables, e.g., static variables, shared common blocks, or global heap.•Threads communicate implicitly by writing and reading shared variables.•Threads coordinate by synchronizing on shared variablesPnP1P0s s = ...y = ..s ...Shared memoryi: 2i: 5Private memoryi: 801/26/2010CS267 Lecture 38Simple Example•Shared memory strategy:•small number p << n=size(A) processors •attached to single memory•Parallel Decomposition: •Each evaluation and each partial sum is a task.•Assign n/p numbers to each of p procs•Each computes independent “private” results and partial sum.•Collect the p partial sums and compute a global sum.Two Classes of Data: •Logically Shared•The original n numbers, the global sum.•Logically Private•The individual function evaluations.•What about the individual partial sums?10])[(niiAf01/26/2010CS267 Lecture 39Shared Memory “Code” for Computing a SumThread 1 for i = 0, n/2-1 s = s + f(A[i])Thread 2 for i = n/2, n-1 s = s + f(A[i])static int s = 0;•Problem is a race condition on variable s in the program•A race condition or data race occurs when:-two processors (or two threads) access the same variable, and at least one does a write.-The accesses are concurrent (not synchronized) so they could happen simultaneously01/26/2010CS267 Lecture 310Shared Memory “Code” for Computing a SumThread 1 …. compute f([A[i]) and put in reg0 reg1 = s reg1 = reg1 + reg0 s = reg1 …Thread 2 … compute f([A[i]) and put in reg0 reg1 = s reg1 = reg1 + reg0 s = reg1 …static int s = 0;•Assume A = [3,5], f(x) = x2, and s=0 initially•For this program to work, s should be 32 + 52 = 34 at the end•but it may be 34,9, or 25•The atomic operations are reads and writes•Never see ½ of one number, but no += operation is not atomic•All computations happen in (private) registers9 25009 252593 5A=f (x) = x201/26/2010CS267 Lecture 311Improved Code for Computing a SumThread 1 local_s1= 0 for i = 0, n/2-1 local_s1 = local_s1 + f(A[i]) s = s + local_s1 Thread 2 local_s2 = 0 for i = n/2, n-1 local_s2= local_s2 + f(A[i]) s = s +local_s2 static int s = 0;•Since addition is associative, it’s OK to rearrange order•Most computation is on private variables-Sharing frequency is also reduced, which might improve speed -But there is still a race condition on the update of shared s-The race condition can be fixed by adding locks (only one thread can hold a lock at a time; others wait for it)static lock lk;lock(lk);unlock(lk);lock(lk);unlock(lk);01/26/2010CS267 Lecture 312Machine Model 1a: Shared

View Full Document

Berkeley COMPSCI C267 - Lecture Notes

Sign up for free to view:

This document and 3 million+ documents and flashcards
High quality study guides, lecture notes, practice exams
Course Packets handpicked by editors offering a comprehensive review of your courses
Better Grades Guaranteed


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-3-4-26-27-28-54-55-56-57 out of 57 pages.

Berkeley COMPSCI C267 - Lecture Notes

Sign up for free to view:

Please select your school