CS 61C Great Ideas in Computer Architecture Machine Structures Thread Level Parallelism Instructors Randy H Katz David A Patterson http inst eecs Berkeley edu cs61c sp11 01 14 2019 Spring 2011 Lecture 15 1 01 14 2019 Spring 2011 Lecture 15 2 You Are Here Software Parallel Requests Assigned to computer e g Search Katz Parallel Threads Assigned to core e g Lookup Ads Hardware Harness Parallelism Achieve High Performance Warehouse Scale Computer Smart Phone Computer Parallel Instructions Memory 1 instruction one time e g 5 pipelined instructions Parallel Data 1 data item one time e g Add of 4 pairs of words Hardware descriptions All gates functioning in parallel at same time 01 14 2019 Core Cache Input Output Today s Lecture Core Instruction Unit s Project 3 Core Functional Unit s A0 B0 A1 B1 A2 B2 A3 B3 Main Memory Logic Gates Spring 2011 Lecture 15 3 Agenda Multiprocessor Systems Administrivia Multiprocessor Cache Consistency Synchronization Technology Break OpenMP Introduction Summary 01 14 2019 Spring 2011 Lecture 15 4 Agenda Multiprocessor Systems Administrivia Multiprocessor Cache Consistency Synchronization Technology Break OpenMP Introduction Summary 01 14 2019 Spring 2011 Lecture 15 5 Parallel Processing Multiprocessor Systems MIMD Multiprocessor MIMD a computer system with at least 2 processors Processor Processor Processor Cache Cache Cache Interconnection Network Memory I O 1 Deliver high throughput for independent jobs via request level or task level parallelism 2 Improve the run time of a single program that has been specially crafted to run on a multiprocessor a parallel processing program Now Use term core for processor Multicore because Multiprocessor Microprocessor is redundant 01 14 2019 Spring 2011 Lecture 15 6 Transition to Multicore Sequential App Performance 01 14 2019 Spring 2011 Lecture 15 7 Multiprocessors and You Only path to performance is parallelism Clock rates flat or declining SIMD 2X width every 3 4 years 128b wide now 256b 2011 512b in 2014 1024b in 2018 MIMD Add 2 cores every 2 years 2 4 6 8 10 Key challenge is to craft parallel programs that have high performance on multiprocessors as the number of processors increase i e that scale Scheduling load balancing time for synchronization overhead for communication Project 3 fastest matrix multiply code on 8 processor 8 cores computers 2 chips or sockets computer 4 cores chip 01 14 2019 Spring 2011 Lecture 15 8 Potential Parallel Performance Assuming SW can use it Year 2003 2005 2007 2009 2011 2013 2015 2017 2019 2021 01 14 2019 Cores MIMD 2 2 4 2yrs 6 8 10 12 2 5X14 16 18 20 SIMD bits Core 128 2X 128 4yrs 128 128 256 256 8X 512 512 1024 1024 SIMD Spring 2011 Lecture 15 Core SIMD bits Peak DP GFLOPs 256 MIMD 4 512 SIMD 8 768 12 1024 16 2560 40 3072 48 7168 20X112 8192 128 18432 288 20480 320 9 Three Key Questions about Multiprocessors Q1 How do they share data Q2 How do they coordinate Q3 How many processors can be supported 01 14 2019 Spring 2011 Lecture 15 10 Three Key Questions about Multiprocessors Q1 How do they share data Single address space shared by all processors cores 01 14 2019 Spring 2011 Lecture 15 11 Three Key Questions about Multiprocessors Q2 How do they coordinate Processors coordinate communicate through shared variables in memory via loads and stores Use of shared data must be coordinated via synchronization primitives locks that allow access to data to only one processor at a time All multicore computers today are Shared Memory Multiprocessors SMPs 01 14 2019 Spring 2011 Lecture 15 12 Example Sum Reduction Sum 100 000 numbers on 100 processor SMP Each processor has ID 0 Pn 99 Partition 1000 numbers per processor Initial summation on each processor sum Pn 0 for i 1000 Pn i 1000 Pn 1 i i 1 sum Pn sum Pn A i Now need to add these partial sums Reduction divide and conquer Half the processors add pairs then quarter Need to synchronize between reduction steps 01 14 2019 Spring 2011 Lecture 15 13 Example Sum Reduction This code executes simultaneously in P0 P1 P7 half 8 repeat synch if half 2 0 Pn 0 sum 0 sum 0 sum half 1 Conditional sum needed when half is odd Processor0 gets extra element half half 2 dividing line on who sums if Pn half sum Pn sum Pn sum Pn half until half 1 01 14 2019 Spring 2011 Lecture 15 14 Student Roulette An Example with 10 Processors sum P0 sum P1 sum P2 sum P3 sum P4 sum P5 sum P6 sum P7 sum P8 sum P9 P0 01 14 2019 P1 P2 P3 P4 P5 P6 Spring 2011 Lecture 15 P7 P8 P9 half 10 15 An Example with 10 Processors sum P0 sum P1 sum P2 sum P3 sum P4 sum P5 sum P6 sum P7 sum P8 sum P9 P0 P1 P2 P3 P4 P0 P1 P2 P3 P4 P0 P1 P5 P6 P8 P9 half 10 half 5 half 2 half 1 P0 01 14 2019 P7 Spring 2011 Lecture 15 16 Three Key Questions about Multiprocessors Q3 How many processors can be supported Key bottleneck in an SMP is the memory system Caches can effectively increase memory bandwidth open the bottleneck But what happens to the memory being actively shared among the processors through the caches 01 14 2019 Spring 2011 Lecture 15 17 Shared Memory and Caches What if Processors 1 and 2 read Memory 1000 value 20 Processor0 Cache Processor1 Processor2 1000 Cache 1000 1000 Cache 1000 Interconnection Network Memory 01 14 2019 2020 I O Spring 2011 Lecture 15 18 Student Roulette Shared Memory and Caches What if Processors 1 and 2 read Memory 1000 Processor 0 writes Memory 1000 with 40 1000 Processor0 Processor1 Processor2 1000 Cache40 Cache20 1000 Cache 1000 20 Processor 0 Write Invalidates Other Copies Interconnection Network Memory 1000 40 01 14 2019 I O Spring 2011 Lecture 15 19 Student Roulette Agenda Multiprocessor Systems Administrivia Multiprocessor Cache Consistency Synchronization Technology Break OpenMP Introduction Summary 01 14 2019 Spring 2011 Lecture 15 20 Course Organization Grading Participation and Altruism 5 Homework 5 4 of 6 HWs Completed Labs 20 7 of 12 Labs Completed Projects 40 1 Data Parallelism Map Reduce on Amazon EC2 2 Computer Instruction Set Simulator C 3 Performance Tuning of a Parallel Application Matrix Multiply using cache blocking SIMD MIMD OpenMP due with partner 4 Computer Processor Design Logisim Extra Credit Matrix Multiply Competition anything goes Midterm 10 6 9 PM Tuesday March 8 Final 20 11 30 2 30 PM Monday May 9 01 14 2019 Spring 2011 Lecture 15 21 Midterm Results 20 Students with this score 18 16 14 12 10 8 6 4 2 0 14 17 20 23 26 29 32 35 38 41 44 47 50 53 56 59 62 65 68 71 74 77 80 83 86 89 Score 01 14 2019 Spring 2011 Lecture 15 22 EECS Grading
View Full Document
Unlocking...