DOC PREVIEW
UCLA COMSCI M151B - lec10-c7

This preview shows page 1-2-15-16-31-32 out of 32 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 32 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 32 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 32 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 32 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 32 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 32 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 32 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

7_17_27_37_4Chapter 7Multicores, Multiprocessors, and ClustersChapter 7 — Multicores, Multiprocessors, and Clusters — 2Introduction Goal: connecting multiple computersto get higher performance Multiprocessors Scalability, availability, power efficiency Job-level (process-level) parallelism High throughput for independent jobs Parallel processing program Single program run on multiple processors Multicore microprocessors Chips with multiple processors (cores)§9.1 IntroductionChapter 7 — Multicores, Multiprocessors, and Clusters — 3Parallel Programming Parallel software is the problem Need to get significant performance improvement Otherwise, just use a faster uniprocessor, since it’s easier! Difficulties Partitioning Coordination Communications overhead§7.2 The Difficulty of Creating Parallel Processing ProgramsChapter 7 — Multicores, Multiprocessors, and Clusters — 4Amdahl’s Law Sequential part can limit speedup Example: 100 processors, 90× speedup? Tnew= Tparallelizable/100 + TsequentialSolving: Fparallelizable= 0.999 Need sequential part to be 0.1% of original time90/100F)F(11SpeedupableparallelizableparallelizChapter 7 — Multicores, Multiprocessors, and Clusters — 5Scaling Example Workload: sum of 10 scalars, and 10 × 10 matrix sum Speed up from 10 to 100 processors Single processor: Time = (10 + 100) × tadd 10 processors Time = 10 × tadd+ 100/10 × tadd= 20 × tadd Speedup = 110/20 = 5.5 (55% of potential) 100 processors Time = 10 × tadd+ 100/100 × tadd= 11 × tadd Speedup = 110/11 = 10 (10% of potential) Assumes load can be balanced across processorsChapter 7 — Multicores, Multiprocessors, and Clusters — 6Scaling Example (cont) What if matrix size is 100 × 100? Single processor: Time = (10 + 10000) × tadd 10 processors Time = 10 × tadd+ 10000/10 × tadd= 1010 × tadd Speedup = 10010/1010 = 9.9 (99% of potential) 100 processors Time = 10 × tadd+ 10000/100 × tadd= 110 × tadd Speedup = 10010/110 = 91 (91% of potential) Assuming load balancedChapter 7 — Multicores, Multiprocessors, and Clusters — 7Strong vs Weak Scaling Strong scaling: problem size fixed As in example Weak scaling: problem size proportional to number of processors 10 processors, 10 × 10 matrix Time = 20 × tadd 100 processors, 32 × 32 matrix Time = 10 × tadd+ 1000/100 × tadd= 20 × tadd Constant performance in this exampleChapter 7Multicores, Multiprocessors, and ClustersChapter 7 — Multicores, Multiprocessors, and Clusters — 2Shared Memory SMP: shared memory multiprocessor Hardware provides single physicaladdress space for all processors Synchronize shared variables using locks Memory access time UMA (uniform) vs. NUMA (nonuniform)§7.3 Shared Memory MultiprocessorsChapter 7 — Multicores, Multiprocessors, and Clusters — 3Example: Sum Reduction Sum 100,000 numbers on 100 processor UMA Each processor has ID: 0 ≤ Pn ≤ 99 Partition 1000 numbers per processor Initial summation on each processorsum[Pn] = 0;for (i = 1000*Pn;i < 1000*(Pn+1); i = i + 1)sum[Pn] = sum[Pn] + A[i]; Now need to add these partial sums Reduction: divide and conquer Half the processors add pairs, then quarter, … Need to synchronize between reduction stepsChapter 7 — Multicores, Multiprocessors, and Clusters — 4Example: Sum Reductionhalf = 100;repeatsynch();if (half%2 != 0 && Pn == 0)sum[0] = sum[0] + sum[half-1];/* Conditional sum needed when half is odd;Processor0 gets missing element */half = half/2; /* dividing line on who sums */if (Pn < half) sum[Pn] = sum[Pn] + sum[Pn+half];until (half == 1);Chapter 7 — Multicores, Multiprocessors, and Clusters — 5Message Passing Each processor has private physical address space Hardware sends/receives messages between processors§7.4 Clusters and Other Message-Passing MultiprocessorsChapter 7 — Multicores, Multiprocessors, and Clusters — 6Loosely Coupled Clusters Network of independent computers Each has private memory and OS Connected using I/O system E.g., Ethernet/switch, Internet Suitable for applications with independent tasks Web servers, databases, simulations, … High availability, scalable, affordable Problems Administration cost (prefer virtual machines) Low interconnect bandwidth c.f. processor/memory bandwidth on an SMPChapter 7 — Multicores, Multiprocessors, and Clusters — 7Sum Reduction (Again) Sum 100,000 on 100 processors First distribute 100 numbers to each The do partial sumssum = 0;for (i = 0; i<1000; i = i + 1)sum = sum + AN[i]; Reduction Half the processors send, other half receive and add The quarter send, quarter receive and add, …Chapter 7 — Multicores, Multiprocessors, and Clusters — 8Grid Computing Separate computers interconnected by long-haul networks E.g., Internet connections Work units farmed out, results sent back Can make use of idle time on PCs E.g., SETI@home, World Community GridChapter 7Multicores, Multiprocessors, and ClustersChapter 7 — Multicores, Multiprocessors, and Clusters — 2Multithreading Performing multiple threads of execution in parallel Replicate registers, PC, etc. Fast switching between threads Fine-grain multithreading Switch threads after each cycle Interleave instruction execution If one thread stalls, others are executed Coarse-grain multithreading Only switch on long stall (e.g., L2-cache miss) Simplifies hardware, but doesn’t hide short stalls (eg, data hazards)§7.5 Hardware MultithreadingChapter 7 — Multicores, Multiprocessors, and Clusters — 3Simultaneous Multithreading In multiple-issue dynamically scheduled processor Schedule instructions from multiple threads Instructions from independent threads execute when function units are available Within threads, dependencies handled by scheduling and register renaming Example: Intel Pentium-4 HT Two threads: duplicated registers, shared function units and cachesChapter 7 — Multicores, Multiprocessors, and Clusters — 4Multithreading ExampleChapter 7 — Multicores, Multiprocessors, and Clusters — 5Future of Multithreading Will it survive? In what form? Power considerations  simplified microarchitectures Simpler forms of multithreading Tolerating cache-miss latency Thread


View Full Document

UCLA COMSCI M151B - lec10-c7

Documents in this Course
lec9-c5

lec9-c5

22 pages

lec8-c5

lec8-c5

47 pages

lec8-c4

lec8-c4

27 pages

lec7-c4

lec7-c4

33 pages

lec6-c4

lec6-c4

38 pages

lec5-c4

lec5-c4

33 pages

lec4-c4

lec4-c4

33 pages

Load more
Download lec10-c7
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view lec10-c7 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view lec10-c7 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?