7_17_27_37_4Chapter 7Multicores, Multiprocessors, and ClustersChapter 7 — Multicores, Multiprocessors, and Clusters — 2Introduction Goal: connecting multiple computersto get higher performance Multiprocessors Scalability, availability, power efficiency Job-level (process-level) parallelism High throughput for independent jobs Parallel processing program Single program run on multiple processors Multicore microprocessors Chips with multiple processors (cores)§9.1 IntroductionChapter 7 — Multicores, Multiprocessors, and Clusters — 3Parallel Programming Parallel software is the problem Need to get significant performance improvement Otherwise, just use a faster uniprocessor, since it’s easier! Difficulties Partitioning Coordination Communications overhead§7.2 The Difficulty of Creating Parallel Processing ProgramsChapter 7 — Multicores, Multiprocessors, and Clusters — 4Amdahl’s Law Sequential part can limit speedup Example: 100 processors, 90× speedup? Tnew= Tparallelizable/100 + TsequentialSolving: Fparallelizable= 0.999 Need sequential part to be 0.1% of original time90/100F)F(11SpeedupableparallelizableparallelizChapter 7 — Multicores, Multiprocessors, and Clusters — 5Scaling Example Workload: sum of 10 scalars, and 10 × 10 matrix sum Speed up from 10 to 100 processors Single processor: Time = (10 + 100) × tadd 10 processors Time = 10 × tadd+ 100/10 × tadd= 20 × tadd Speedup = 110/20 = 5.5 (55% of potential) 100 processors Time = 10 × tadd+ 100/100 × tadd= 11 × tadd Speedup = 110/11 = 10 (10% of potential) Assumes load can be balanced across processorsChapter 7 — Multicores, Multiprocessors, and Clusters — 6Scaling Example (cont) What if matrix size is 100 × 100? Single processor: Time = (10 + 10000) × tadd 10 processors Time = 10 × tadd+ 10000/10 × tadd= 1010 × tadd Speedup = 10010/1010 = 9.9 (99% of potential) 100 processors Time = 10 × tadd+ 10000/100 × tadd= 110 × tadd Speedup = 10010/110 = 91 (91% of potential) Assuming load balancedChapter 7 — Multicores, Multiprocessors, and Clusters — 7Strong vs Weak Scaling Strong scaling: problem size fixed As in example Weak scaling: problem size proportional to number of processors 10 processors, 10 × 10 matrix Time = 20 × tadd 100 processors, 32 × 32 matrix Time = 10 × tadd+ 1000/100 × tadd= 20 × tadd Constant performance in this exampleChapter 7Multicores, Multiprocessors, and ClustersChapter 7 — Multicores, Multiprocessors, and Clusters — 2Shared Memory SMP: shared memory multiprocessor Hardware provides single physicaladdress space for all processors Synchronize shared variables using locks Memory access time UMA (uniform) vs. NUMA (nonuniform)§7.3 Shared Memory MultiprocessorsChapter 7 — Multicores, Multiprocessors, and Clusters — 3Example: Sum Reduction Sum 100,000 numbers on 100 processor UMA Each processor has ID: 0 ≤ Pn ≤ 99 Partition 1000 numbers per processor Initial summation on each processorsum[Pn] = 0;for (i = 1000*Pn;i < 1000*(Pn+1); i = i + 1)sum[Pn] = sum[Pn] + A[i]; Now need to add these partial sums Reduction: divide and conquer Half the processors add pairs, then quarter, … Need to synchronize between reduction stepsChapter 7 — Multicores, Multiprocessors, and Clusters — 4Example: Sum Reductionhalf = 100;repeatsynch();if (half%2 != 0 && Pn == 0)sum[0] = sum[0] + sum[half-1];/* Conditional sum needed when half is odd;Processor0 gets missing element */half = half/2; /* dividing line on who sums */if (Pn < half) sum[Pn] = sum[Pn] + sum[Pn+half];until (half == 1);Chapter 7 — Multicores, Multiprocessors, and Clusters — 5Message Passing Each processor has private physical address space Hardware sends/receives messages between processors§7.4 Clusters and Other Message-Passing MultiprocessorsChapter 7 — Multicores, Multiprocessors, and Clusters — 6Loosely Coupled Clusters Network of independent computers Each has private memory and OS Connected using I/O system E.g., Ethernet/switch, Internet Suitable for applications with independent tasks Web servers, databases, simulations, … High availability, scalable, affordable Problems Administration cost (prefer virtual machines) Low interconnect bandwidth c.f. processor/memory bandwidth on an SMPChapter 7 — Multicores, Multiprocessors, and Clusters — 7Sum Reduction (Again) Sum 100,000 on 100 processors First distribute 100 numbers to each The do partial sumssum = 0;for (i = 0; i<1000; i = i + 1)sum = sum + AN[i]; Reduction Half the processors send, other half receive and add The quarter send, quarter receive and add, …Chapter 7 — Multicores, Multiprocessors, and Clusters — 8Grid Computing Separate computers interconnected by long-haul networks E.g., Internet connections Work units farmed out, results sent back Can make use of idle time on PCs E.g., SETI@home, World Community GridChapter 7Multicores, Multiprocessors, and ClustersChapter 7 — Multicores, Multiprocessors, and Clusters — 2Multithreading Performing multiple threads of execution in parallel Replicate registers, PC, etc. Fast switching between threads Fine-grain multithreading Switch threads after each cycle Interleave instruction execution If one thread stalls, others are executed Coarse-grain multithreading Only switch on long stall (e.g., L2-cache miss) Simplifies hardware, but doesn’t hide short stalls (eg, data hazards)§7.5 Hardware MultithreadingChapter 7 — Multicores, Multiprocessors, and Clusters — 3Simultaneous Multithreading In multiple-issue dynamically scheduled processor Schedule instructions from multiple threads Instructions from independent threads execute when function units are available Within threads, dependencies handled by scheduling and register renaming Example: Intel Pentium-4 HT Two threads: duplicated registers, shared function units and cachesChapter 7 — Multicores, Multiprocessors, and Clusters — 4Multithreading ExampleChapter 7 — Multicores, Multiprocessors, and Clusters — 5Future of Multithreading Will it survive? In what form? Power considerations simplified microarchitectures Simpler forms of multithreading Tolerating cache-miss latency Thread
View Full Document