Unformatted text preview:

111/6/2007 1G22.2243-001High Performance Computer ArchitectureLecture 9MultiprocessingOctober 31, 200711/6/2007 2Outline• Announcements– Lab Assignment 2 due back today– Homework Assignment 3 out today. Due in one week: November 7– Lab Assignment 3 out today. Due in two weeks: November 14• Multithreading and multiprocessors– Introduction– Small-scale multiprocessors(Symmetric shared-memory architectures)[ Hennessy/Patterson CA:AQA (4th Edition): Chapter 4]11/6/2007 3Parallel Processing• So far: focused on performance of a single instruction stream– ILP exploits parallelism among the instructions of this stream– Needs to resolve control, data, and memory dependencies• How do we get further improvements in performance?– Exploit parallelism among multiple instruction streams– Multithreading: Streams run on one CPU• Typically, share resources such as functional units, caches, etc.• Per-thread register set– Multiprocessing: Streams run on multiple CPUs• Each CPU can itself be multithreaded– Common issues: • synchronization between threads• Coherence and consistency of data in caches – NYU Course: G22.3033 Architecture and Programming of Parallel Computers 11/6/2007 4Parallel Computers• Definition: “A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast.”Almasi and Gottlieb, Highly Parallel Computing ,1989• Questions about parallel computers:– How large a collection?– How powerful are processing elements?– How do they cooperate and communicate?– How are data transmitted? – What type of interconnection?– What are HW and SW primitives for programmer?– Does it translate into performance?11/6/2007 5What level Parallelism?• Bit level parallelism: 1970 to ~1985– 4 bits, 8 bit, 16 bit, 32 bit microprocessors• Instruction level parallelism (ILP): ~1985 through today– Pipelining– Superscalar–VLIW– Out-of-Order execution– Limits to benefits of ILP• Process Level or Thread level parallelism; mainstream for general purpose computing?–Servers – Highend Desktop dual processor PC 11/6/2007 6Why Multiprocessors?1. Microprocessors as the fastest CPUs• Collecting several much easier than redesigning 12. Complexity of current microprocessors• Do we have enough ideas to sustain 1.5X/yr?• Can we deliver such complexity on schedule?3. Slow (but steady) improvement in parallel software (scientific apps, databases, OS)4. Emergence of embedded and server markets driving microprocessors in addition to desktops• Embedded functional parallelism, producer/consumer model• Server figure of merit is tasks per hour vs. latency211/6/2007 7TOP500 Supercomputers (top500.org)• List of top 500 supercomputers published twice a year • The latest list shows a major shake-up of the TOP10 since last report• Only six of the TOP10 systems from November 2004 are still largeenough to hold on to a TOP10 position, four new systems entered the top tier• No. 1 supercomputer: DOE's IBM BlueGene/L system– Installed at Lawrence Livermore National Laboratory (LLNL)– Achieves a record Linpack performance of 280.6 TFlop/s– It is still the only system ever to exceed the 100 TFlop/s mark– 131,072 processors11/6/2007 8TOP500 architectures and Applications11/6/2007 9TOP500 Processors11/6/2007 10Top500 OS and Interconnects 11/6/2007 11Popular Flynn Categories • SISD (Single Instruction Single Data)– Uniprocessors• MISD (Multiple Instruction Single Data)– ???; multiple processors on a single data stream• SIMD (Single Instruction Multiple Data)– Examples: Illiac-IV, CM-2• Simple programming model• Low overhead• Flexibility• All custom integrated circuits– (Phrase reused by Intel marketing for media instructions ~ vector)• MIMD (Multiple Instruction Multiple Data)– Examples: Sun Enterprise 5000, Cray T3D, SGI Origin• Flexible• Use off-the-shelf micros11/6/2007 12Two Major MIMD Styles1. Centralized shared memory • UMA: Uniform Memory Access• Symmetric (shared memory) multiprocessors (SMPs)ProcessorCachesProcessorCachesProcessorCachesProcessorCachesMain MemoryI/O System311/6/2007 13Two Major MIMD Styles2. Decentralized memory (memory module with CPU) • Get more memory bandwidth, lower memory latency• Drawback: Longer communication latency• Drawback: Software model more complex• Two major communication modelsProcessor& CachesMemory I/OProcessor& CachesMemory I/OProcessor& CachesMemory I/OProcessor& CachesMemory I/OInterconnection network11/6/2007 14Communication Models forDecentralized Memory versions1. Shared Address Space: • Called distributed Shared-memory (DSM)• Shared Æ shared address space• Shared Memory with "Non Uniform Memory Access" time (NUMA)2. Multiple Private Address Spaces:• Message passing "multicomputer" with separate address space per processor• Can invoke software with Remote Procedue Call (RPC)• Often via library, such as MPI: Message Passing Interface• Also called "Synchronous communication" since communication causes synchronization between 2 processes• Asynchronous communication for higher performance11/6/2007 15Communication Performance Metrics: Latency and Bandwidth1. Bandwidth– Need high bandwidth in communication– Match limits in network, memory, and processor– Node bandwidth vs. bisection bandwidth of network2. Latency– Affects performance, since processor may have to wait– Affects ease of programming, since requires more thought to overlap communication and computation– Overhead to communicate is a problem in many machines3. Latency Hiding– How can a mechanism help hide latency?– Increases programming system burden– Examples: overlap message send with computation, prefetch data, switch to other tasks11/6/2007 16Parallel Framework• Layers:– Programming Model:• Multiprogramming : lots of jobs, no communication• Shared address space: communicate via memory• Message passing: send and receive messages• Data Parallel: several agents operate on several data sets simultaneously and then exchange information globally and simultaneously (shared or message passing)– Communication Abstraction:• Shared address space: e.g., load, store, atomic swap• Message passing: e.g., send, receive library calls• Debate over this topic (ease of programming, scaling) 11/6/2007 17St o r eP1P2PnP0LoadP0pr i vat eP1pr i vat eP2pr i vat ePnpr i vat eVirtual address spaces for


View Full Document

NYU CSCI-GA 2243 - Multiprocessing

Download Multiprocessing
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Multiprocessing and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Multiprocessing 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?