UCLA COMSCI M151B - lec10-c7 - D3104412

Home> Schools> University of California, Los Angeles> Computer Science (COMSCI) > COMSCI M151B> lec10-c7

DOC PREVIEW

UCLA COMSCI M151B - lec10-c7

School name University of California, Los Angeles

Course Comsci M151b- Computer Systems Architecture

Pages 32

This preview shows page 1-2-15-16-31-32 out of 32 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 32 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 32 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 32 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 32 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 32 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 32 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 32 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

7_17_27_37_4Chapter 7Multicores, Multiprocessors, and ClustersChapter 7 — Multicores, Multiprocessors, and Clusters — 2Introduction Goal: connecting multiple computersto get higher performance Multiprocessors Scalability, availability, power efficiency Job-level (process-level) parallelism High throughput for independent jobs Parallel processing program Single program run on multiple processors Multicore microprocessors Chips with multiple processors (cores)§9.1 IntroductionChapter 7 — Multicores, Multiprocessors, and Clusters — 3Parallel Programming Parallel software is the problem Need to get significant performance improvement Otherwise, just use a faster uniprocessor, since it’s easier! Difficulties Partitioning Coordination Communications overhead§7.2 The Difficulty of Creating Parallel Processing ProgramsChapter 7 — Multicores, Multiprocessors, and Clusters — 4Amdahl’s Law Sequential part can limit speedup Example: 100 processors, 90× speedup? Tnew= Tparallelizable/100 + TsequentialSolving: Fparallelizable= 0.999 Need sequential part to be 0.1% of original time90/100F)F(11SpeedupableparallelizableparallelizChapter 7 — Multicores, Multiprocessors, and Clusters — 5Scaling Example Workload: sum of 10 scalars, and 10 × 10 matrix sum Speed up from 10 to 100 processors Single processor: Time = (10 + 100) × tadd 10 processors Time = 10 × tadd+ 100/10 × tadd= 20 × tadd Speedup = 110/20 = 5.5 (55% of potential) 100 processors Time = 10 × tadd+ 100/100 × tadd= 11 × tadd Speedup = 110/11 = 10 (10% of potential) Assumes load can be balanced across processorsChapter 7 — Multicores, Multiprocessors, and Clusters — 6Scaling Example (cont) What if matrix size is 100 × 100? Single processor: Time = (10 + 10000) × tadd 10 processors Time = 10 × tadd+ 10000/10 × tadd= 1010 × tadd Speedup = 10010/1010 = 9.9 (99% of potential) 100 processors Time = 10 × tadd+ 10000/100 × tadd= 110 × tadd Speedup = 10010/110 = 91 (91% of potential) Assuming load balancedChapter 7 — Multicores, Multiprocessors, and Clusters — 7Strong vs Weak Scaling Strong scaling: problem size fixed As in example Weak scaling: problem size proportional to number of processors 10 processors, 10 × 10 matrix Time = 20 × tadd 100 processors, 32 × 32 matrix Time = 10 × tadd+ 1000/100 × tadd= 20 × tadd Constant performance in this exampleChapter 7Multicores, Multiprocessors, and ClustersChapter 7 — Multicores, Multiprocessors, and Clusters — 2Shared Memory SMP: shared memory multiprocessor Hardware provides single physicaladdress space for all processors Synchronize shared variables using locks Memory access time UMA (uniform) vs. NUMA (nonuniform)§7.3 Shared Memory MultiprocessorsChapter 7 — Multicores, Multiprocessors, and Clusters — 3Example: Sum Reduction Sum 100,000 numbers on 100 processor UMA Each processor has ID: 0 ≤ Pn ≤ 99 Partition 1000 numbers per processor Initial summation on each processorsum[Pn] = 0;for (i = 1000*Pn;i < 1000*(Pn+1); i = i + 1)sum[Pn] = sum[Pn] + A[i]; Now need to add these partial sums Reduction: divide and conquer Half the processors add pairs, then quarter, … Need to synchronize between reduction stepsChapter 7 — Multicores, Multiprocessors, and Clusters — 4Example: Sum Reductionhalf = 100;repeatsynch();if (half%2 != 0 && Pn == 0)sum[0] = sum[0] + sum[half-1];/* Conditional sum needed when half is odd;Processor0 gets missing element */half = half/2; /* dividing line on who sums */if (Pn < half) sum[Pn] = sum[Pn] + sum[Pn+half];until (half == 1);Chapter 7 — Multicores, Multiprocessors, and Clusters — 5Message Passing Each processor has private physical address space Hardware sends/receives messages between processors§7.4 Clusters and Other Message-Passing MultiprocessorsChapter 7 — Multicores, Multiprocessors, and Clusters — 6Loosely Coupled Clusters Network of independent computers Each has private memory and OS Connected using I/O system E.g., Ethernet/switch, Internet Suitable for applications with independent tasks Web servers, databases, simulations, … High availability, scalable, affordable Problems Administration cost (prefer virtual machines) Low interconnect bandwidth c.f. processor/memory bandwidth on an SMPChapter 7 — Multicores, Multiprocessors, and Clusters — 7Sum Reduction (Again) Sum 100,000 on 100 processors First distribute 100 numbers to each The do partial sumssum = 0;for (i = 0; i<1000; i = i + 1)sum = sum + AN[i]; Reduction Half the processors send, other half receive and add The quarter send, quarter receive and add, …Chapter 7 — Multicores, Multiprocessors, and Clusters — 8Grid Computing Separate computers interconnected by long-haul networks E.g., Internet connections Work units farmed out, results sent back Can make use of idle time on PCs E.g., SETI@home, World Community GridChapter 7Multicores, Multiprocessors, and ClustersChapter 7 — Multicores, Multiprocessors, and Clusters — 2Multithreading Performing multiple threads of execution in parallel Replicate registers, PC, etc. Fast switching between threads Fine-grain multithreading Switch threads after each cycle Interleave instruction execution If one thread stalls, others are executed Coarse-grain multithreading Only switch on long stall (e.g., L2-cache miss) Simplifies hardware, but doesn’t hide short stalls (eg, data hazards)§7.5 Hardware MultithreadingChapter 7 — Multicores, Multiprocessors, and Clusters — 3Simultaneous Multithreading In multiple-issue dynamically scheduled processor Schedule instructions from multiple threads Instructions from independent threads execute when function units are available Within threads, dependencies handled by scheduling and register renaming Example: Intel Pentium-4 HT Two threads: duplicated registers, shared function units and cachesChapter 7 — Multicores, Multiprocessors, and Clusters — 4Multithreading ExampleChapter 7 — Multicores, Multiprocessors, and Clusters — 5Future of Multithreading Will it survive? In what form? Power considerations  simplified microarchitectures Simpler forms of multithreading Tolerating cache-miss latency Thread

View Full Document


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-15-16-31-32 out of 32 pages.

UCLA COMSCI M151B - lec10-c7

Sign up for free to view:

Please select your school