3 11 11 CS 61C Great Ideas in Computer Architecture Machine Structures Thread Level Parallelism Instructors Randy H Katz David A PaFerson hFp inst eecs Berkeley edu cs61c sp11 3 10 11 Spring 2011 Lecture 15 1 3 10 11 Spring 2011 Lecture 15 You Are Here Agenda So1ware Hardware Parallel Requests Assigned to computer e g Search Katz Parallel Threads Assigned to core e g Lookup Ads Smart Phone Warehouse Scale Computer Harness Parallelism Achieve High Performance Computer Parallel InstrucYons 1 instrucYon one Yme e g 5 pipelined instrucYons Parallel Data 1 data item one Yme e g Add of 4 pairs of words Hardware descripYons All gates funcYoning in parallel at same Yme 3 10 11 Core Core Memory Cache Input Output Today s InstrucYon Unit s Lecture Project 3 Core FuncYonal Unit s A0 B0 A1 B1 A2 B2 A3 B3 Main Memory MulYprocessor Systems Administrivia MulYprocessor Cache Consistency SynchronizaYon Technology Break OpenMP IntroducYon Summary Logic Gates Spring 2011 Lecture 15 3 3 10 11 Spring 2011 Lecture 15 4 MulYprocessor MIMD a computer system with at least 2 processors MulYprocessor Systems Administrivia MulYprocessor Cache Consistency SynchronizaYon Technology Break OpenMP IntroducYon Summary 3 10 11 Spring 2011 Lecture 15 Parallel Processing MulYprocessor Systems MIMD Agenda 2 Processor Processor Processor Cache Cache Cache Interconnec3on Network Memory I O 1 Deliver high throughput for independent jobs via request level or task level parallelism 2 Improve the run me of a single program that has been specially cra9ed to run on a mul processor a parallel processing program Now Use term core for processor Mul3core because Mul3processor Microprocessor is redundant 5 3 10 11 Spring 2011 Lecture 15 6 1 3 11 11 TransiYon to MulYcore MulYprocessors and You Only path to performance is parallelism Clock rates at or declining SIMD 2X width every 3 4 years 128b wide now 256b 2011 512b in 2014 1024b in 2018 Sequential App Performance MIMD Add 2 cores every 2 years 2 4 6 8 10 Key challenge is to cram parallel programs that have high performance on mulYprocessors as the number of processors increase i e that scale Scheduling load balancing Yme for synchronizaYon overhead for communicaYon Project 3 fastest matrix mulYply code on 8 processor 8 cores computers 2 chips or sockets computer 4 cores chip 3 10 11 Spring 2011 Lecture 15 7 3 10 11 PotenYal Parallel Performance Assuming SW can use it Year Cores 2003 2005 2007 2009 2011 2013 2015 2017 2019 2021 3 10 11 MIMD 2 2 4 2yrs 6 8 10 12 2 5X 14 16 18 20 SIMD bits Core SIMD Core SIMD bits 8 Three Key QuesYons about MulYprocessors Peak DP GFLOPs 128 128 128 128 256 256 8X 512 512 1024 1024 256 MIMD 4 512 SIMD 8 768 12 1024 16 2560 40 3072 48 7168 20X 112 8192 128 18432 288 20480 320 Spring 2011 Lecture 15 9 2X 4yrs Spring 2011 Lecture 15 Q1 How do they share data Q2 How do they coordinate Q3 How many processors can be supported 3 10 11 Three Key QuesYons about MulYprocessors Spring 2011 Lecture 15 10 Three Key QuesYons about MulYprocessors Q1 How do they share data Single address space shared by all processors cores Q2 How do they coordinate Processors coordinate communicate through shared variables in memory via loads and stores Use of shared data must be coordinated via synchronizaYon primiYves locks that allow access to data to only one processor at a Yme All mulYcore computers today are Shared Memory MulYprocessors SMPs 3 10 11 Spring 2011 Lecture 15 11 3 10 11 Spring 2011 Lecture 15 12 2 3 11 11 Example Sum ReducYon Example Sum ReducYon Sum 100 000 numbers on 100 processor SMP This code executes simultaneously in P0 P1 P7 Each processor has ID 0 Pn 99 ParYYon 1000 numbers per processor IniYal summaYon on each processor half 8 repeat synch if half 2 0 Pn 0 sum 0 sum 0 sum half 1 Conditional sum needed when half is odd Processor0 gets extra element half half 2 dividing line on who sums if Pn half sum Pn sum Pn sum Pn half until half 1 sum Pn 0 for i 1000 Pn i 1000 Pn 1 i i 1 sum Pn sum Pn A i Now need to add these parYal sums Reduc on divide and conquer Half the processors add pairs then quarter Need to synchronize between reducYon steps 3 10 11 Spring 2011 Lecture 15 13 3 10 11 An Example with 10 Processors P1 P2 P3 P4 P5 P6 P7 P8 P9 Student RouleFe sum P0 sum P1 sum P2 sum P3 sum P4 sum P5 sum P6 sum P7 sum P8 sum P9 half 10 P0 P1 P2 P3 P4 P0 P1 P2 P3 P4 P0 P1 P5 P6 P7 P8 P9 Spring 2011 Lecture 15 15 Three Key QuesYons about MulYprocessors half 2 half 1 Spring 2011 Lecture 15 3 10 11 Spring 2011 Lecture 15 16 Shared Memory and Caches Q3 How many processors can be supported Key boFleneck in an SMP is the memory system Caches can e ecYvely increase memory bandwidth open the boFleneck But what happens to the memory being acYvely shared among the processors through the caches 3 10 11 half 10 half 5 P0 3 10 11 14 An Example with 10 Processors sum P0 sum P1 sum P2 sum P3 sum P4 sum P5 sum P6 sum P7 sum P8 sum P9 P0 Spring 2011 Lecture 15 17 What if Processors 1 and 2 read Memory 1000 value 20 Processor 0 Cache Processor 1 Processor 2 1000 Cache 1000 1000 Cache 1000 Interconnec3on Network Memory 20 20 3 10 11 Spring 2011 Lecture 15 I O Student RouleFe 18 3 3 11 11 Shared Memory and Caches Agenda What if Processors 1 and 2 read Memory 1000 Processor 0 writes Memory 1000 with 40 1000 Processor 0 Processor 1 Processor 2 1000 Cache 40 Cache 20 1000 Cache 1000 20 Processor 0 Write Invalidates Other Copies Interconnec3on Network Memory 1000 40 3 10 11 MulYprocessor Systems Administrivia MulYprocessor Cache Consistency SynchronizaYon Technology Break OpenMP IntroducYon Summary I O Spring 2011 Lecture 15 Student RouleFe 19 3 10 11 Spring 2011 Lecture 15 Course OrganizaYon Midterm Results Grading 20 ParYcipaYon and Altruism 5 Homework 5 4 of 6 HWs Completed Labs 20 7 of 12 Labs Completed Projects 40 18 1 Data Parallelism Map Reduce on Amazon EC2 2 Computer Instruc on Set Simulator C 3 Performance Tuning of a Parallel ApplicaYon Matrix MulYply using cache blocking SIMD MIMD OpenMP due with partner 4 Computer Processor Design Logisim Extra Credit Matrix MulYply CompeYYon anything goes Midterm 10 6 9 PM Tuesday March 8 Final 20 11 30 2 30 PM Monday May 9 3 10 11 Spring 2011 Lecture 15 Students with this score 16 14 12 10 8 6 4 2 0 14 21 3 10 11 Midterm Results A 74 89 Students with this score B 50 73 18 C score 14 49 16 14 12 10 8 6 4 2 0 14 3 10 11 19 24 29 34 …
View Full Document
Unlocking...