Unformatted text preview:

14/3/2006 1G22.2243-001High Performance Computer ArchitectureLecture 10MultiprocessingMarch 29, 20064/3/2006 2Outline• Announcements– Lab Assignment 3 due back today; extended to Monday 10am.– Homework Assignment 3 out today. Due in one week: April 5– Lab Assignment 4 out today. Due in two weeks: April 12• Multithreading and multiprocessors– Introduction– Small-scale multiprocessors(Symmetric shared-memory architectures)[ Hennessy/Patterson CA:AQA (3rd Edition): Chapter 6]4/3/2006 3Parallel Processing• So far: focused on performance of a single instruction stream– ILP exploits parallelism among the instructions of this stream– Needs to resolve control, data, and memory dependencies• How do we get further improvements in performance?– Exploit parallelism among multiple instruction streams– Multithreading: Streams run on one CPU• Typically, share resources such as functional units, caches, etc.• Per-thread register set– Multiprocessing: Streams run on multiple CPUs• Each CPU can itself be multithreaded– Common issues: • synchronization between threads• consistency of data in caches (more generally, communication)• NYU Course: G22.3033 Architecture and Programming of Parallel Computers 4/3/2006 4Parallel Computers• Definition: “A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast.”Almasi and Gottlieb, Highly Parallel Computing ,1989• Questions about parallel computers:– How large a collection?– How powerful are processing elements?– How do they cooperate and communicate?– How are data transmitted? – What type of interconnection?– What are HW and SW primitives for programmer?– Does it translate into performance?4/3/2006 5What level Parallelism?• Bit level parallelism: 1970 to ~1985– 4 bits, 8 bit, 16 bit, 32 bit microprocessors• Instruction level parallelism (ILP): ~1985 through today– Pipelining– Superscalar–VLIW– Out-of-Order execution– Limits to benefits of ILP?• Process Level or Thread level parallelism; mainstream for general purpose computing?–Servers – Highend Desktop dual processor PC 4/3/2006 6Why Multiprocessors?1. Microprocessors as the fastest CPUs• Collecting several much easier than redesigning 12. Complexity of current microprocessors• Do we have enough ideas to sustain 1.5X/yr?• Can we deliver such complexity on schedule?3. Slow (but steady) improvement in parallel software (scientific apps, databases, OS)4. Emergence of embedded and server markets driving microprocessors in addition to desktops• Embedded functional parallelism, producer/consumer model• Server figure of merit is tasks per hour vs. latency24/3/2006 7TOP500 Supercomputers (top500.org)• List of top 500 supercomputers published twice a year • The latest list shows a major shake-up of the TOP10 since last report• Only six of the TOP10 systems from November 2004 are still largeenough to hold on to a TOP10 position, four new systems entered the top tier• No. 1 supercomputer: DOE's IBM BlueGene/L system– Installed at Lawrence Livermore National Laboratory (LLNL– Achieves a record Linpack performance of 280.6 TFlop/s– It is still the only system ever to exceed the 100 TFlop/s mark– 131,072 processors4/3/2006 8TOP500 architectures and Applications4/3/2006 9TOP500 Processors4/3/2006 10Top500 OS and Interconnects 4/3/2006 11Popular Flynn Categories • SISD (Single Instruction Single Data)– Uniprocessors• MISD (Multiple Instruction Single Data)– ???; multiple processors on a single data stream• SIMD (Single Instruction Multiple Data)– Examples: Illiac-IV, CM-2• Simple programming model• Low overhead• Flexibility• All custom integrated circuits– (Phrase reused by Intel marketing for media instructions ~ vector)• MIMD (Multiple Instruction Multiple Data)– Examples: Sun Enterprise 5000, Cray T3D, SGI Origin• Flexible• Use off-the-shelf micros4/3/2006 12Two Major MIMD Styles1. Centralized shared memory • UMA: Uniform Memory Access• Symmetric (shared memory) multiprocessors (SMPs)ProcessorCachesProcessorCachesProcessorCachesProcessorCachesMain MemoryI/O System34/3/2006 13Two Major MIMD Styles2. Decentralized memory (memory module with CPU) • Get more memory bandwidth, lower memory latency• Drawback: Longer communication latency• Drawback: Software model more complex• Two major communication modelsProcessor& CachesMemory I/OProcessor& CachesMemory I/OProcessor& CachesMemory I/OProcessor& CachesMemory I/OInterconnection network4/3/2006 14Communication Models forDecentralized Memory versions1. Shared Address Space: • Called distributed Shared-memory (DSM)• Shared Æ shared address space• Shared Memory with "Non Uniform Memory Access" time (NUMA)2. Multiple Private Address Spaces:• Message passing "multicomputer" with separate address space per processor• Can invoke software with Remote Procedue Call (RPC)• Often via library, such as MPI: Message Passing Interface• Also called "Synchronous communication" since communication causes synchronization between 2 processes• Asynchronous communication for higher performance4/3/2006 15Communication Performance Metrics: Latency and Bandwidth1. Bandwidth– Need high bandwidth in communication– Match limits in network, memory, and processor– Node bandwidth vs. bisection bandwidth of network2. Latency– Affects performance, since processor may have to wait– Affects ease of programming, since requires more thought to overlap communication and computation– Overhead to communicate is a problem in many machines3. Latency Hiding– How can a mechanism help hide latency?– Increases programming system burden– Examples: overlap message send with computation, prefetch data, switch to other tasks4/3/2006 16Parallel Framework• Layers:– Programming Model:• Multiprogramming : lots of jobs, no communication• Shared address space: communicate via memory• Message passing: send and receive messages• Data Parallel: several agents operate on several data sets simultaneously and then exchange information globally and simultaneously (shared or message passing)– Communication Abstraction:• Shared address space: e.g., load, store, atomic swap• Message passing: e.g., send, receive library calls• Debate over this topic (ease of programming, scaling) => many hardware designs 1:1 programming model4/3/2006 17St o r eP1P2PnP0LoadP0pr i vat eP1pr


View Full Document

NYU CSCI-GA 2243 - Multiprocessing

Download Multiprocessing
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Multiprocessing and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Multiprocessing 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?