Parallel Computers Definition A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast Multiprocessors or Parallel Computers or Parallel Processors Almasi and Gottlieb Highly Parallel Computing 1989 Questions about parallel computers How large a collection How powerful are processing elements How do they cooperate and communicate How are data transmitted What type of interconnection What are HW and SW primitives for programmer Does it translate into performance Opportunities Scientific Computing Parallel Processors Religion Nearly Unlimited Demand Grand Challenge App Perf GFLOPS Memory GB 48 hour weather 0 1 0 1 72 hour weather 3 1 Pharmaceutical design 100 10 Global Change Genome 1000 1000 The dream of computer architects since 1960 replicate processors to add performance vs design a faster processor Led to innovative organization tied to particular programming models since uniprocessors can t keep going e g uniprocessors must stop getting faster due to limit of speed of light 1972 1989 Borders religious fervor you must believe Fervor damped some when 1990s companies went out of business Thinking Machines Kendall Square Successes in some real industries Argument instead is the pull of opportunity of scalable performance not the push of uniprocessor performance plateau Petroleum reservoir modeling Automotive crash simulation drag analysis engine Aeronautics airflow analysis engine structural mechanics Pharmaceuticals molecular modeling Entertainment full length movies Toy Story Page 1 1 Opportunities Commercial Computing Example Scientific Computing Transaction processing TPC C benchmark Molecular Dynamics on Intel Paragon with 128 processors 1994 Improve over time load balancing other 128 processor Intel Paragon 406 MFLOPS C90 vector 145 MFLOPS or 45 Intel processors small scale parallel processors to large scale Others File servers electronic CAD simulation multiple processes WWW search engines Flynn s Taxonomy of Parallel Architectures What level Parallelism Bit level parallelism 1970 to 1985 Single Instruction Single Data SISD 4 bits 8 bit 16 bit 32 bit microprocessors uniprocessor Instruction level parallelism ILP 1985 through today Single Instruction Multiple Data SIMD Illiac IV CM 2 Pipelining Superscalar VLIW Out of Order execution Limits to benefits of ILP Multiple Instruction Single Data MISD Multiple Instruction Multiple Data MIMD MIMD advantages MIMDs are more flexible Process Level or Thread level parallelism mainstream for general purpose computing Function as single and or multiprogrammed machine Servers are parallel Highend Desktop dual processor PC MIMDs are cost effective Off the shelf microprocessors Page 2 2 Data Parallel Model SIMD Data Parallel Model SIMD SIMD led to Data Parallel Programming languages Advancing VLSI led to single chip FPUs and whole fast Procs SIMD less attractive SIMD programming model led to Single Program Multiple Data SPMD model Operations can be performed in parallel on each element of a large regular data structure such as an array 1 Control Processor broadcast to many PEs Condition flag per PE so that can skip Data distributed in each memory Data parallel programming languages lay out data to processor All processors execute identical program Data parallel programming languages still useful do communication all at once Bulk Synchronous phases in which all communicate after a global barrier Small Scale MIMD Large Scale MIMD Memory distributed with nonuniform access time NUMA and scalable interconnect distributed memory Examples T3D Exemplar Paragon CM 5 Memory centralized with uniform access time UMA and bus interconnect Examples SPARCCenter Challenge Page 3 3 Communication Models for NUMA Shared Address Model Each processor can name every physical location in the machine Each process can name all data it shares with other processes Data transfer via load and store Data size byte word or cache blocks Uses virtual memory to map virtual to local or remote physical Memory hierarchy model applies now communication moves data to local proc cache as load moves data from memory to cache Distributed Shared Memory DSM Processors communicate with shared address space Easy on small scale machines Advantages Model of choice for uniprocessors small scale MPs Ease of programming Message passing Processors have private memories separate address space communicate via messages Advantages Less hardware easier to design Focuses attention on costly non local operations Message Passing Model Message Passing Model Whole computer CPU memory I O devices communicate as explicit I O operations memory memory copy Send receive synchronization on OS even on 1 processor Example IBM SP 2 RS6000 workstations in racks Essentially NUMA but integrated at I O devices vs memory system Send specifies local buffer receiving process on remote computer Receive specifies sending process on remote computer local buffer to place data Network Interface Card has Intel 960 8X8 Crossbar switch as communication building block 40 MByte sec per link Usually send includes process tag and receive has rule on tag match 1 match any Synch when send completes when buffer free when request accepted receive wait for send Send receive memory memory copy where each supplies local address AND does pairwise synchronization Page 4 4 Issues in Multiprocessors Limited Parallelism Not all applications show sufficient parallelism or do they Limited Parallelism Synchronization Latency and Bandwidth Obvious parallelism Not so obvious parallelism Solution Software compilers Synchronization Latency and Bandwidth To cooperate processes must coordinate Message passing is implicit coordination with transmission or arrival of data Shared address additional operations to explicitly coordinate e g write a flag awaken a thread interrupt a processor Discussed later in 8 5 possibly Bandwidth Need high bandwidth in communication Cannot scale but stay close Match limits in network memory and processor Overhead to communicate is a problem in many machines Latency Affects performance since processor may have to wait Affects ease of programming since requires more thought to overlap communication and computation Latency Hiding How can a mechanism help hide latency Examples overlap message send with computation prefetch data switch to other tasks Discussed in Sections 8 3 8 4 8 6 Page 5 5 Solution to Bandwidth and Latency Problem Cache Cache Coherence
View Full Document