Scalable Distributed Memory Machines Goal Parallel machines that can be scaled to hundreds or thousands of processors Design Choices Custom designed or commodity nodes Network scalability Capability of node to network interface critical Supporting programming models What does hardware scalability mean Avoids inherent design limits on resources Bandwidth increases with machine size P Latency should not increase with machine size P Cost should increase slowly with P EECC756 Shaaban 1 lec 13 Spring2002 5 2 2002 MPPs Scalability Issues Problems Memory access latency Interprocess communication complexity or synchronization overhead Multi cache inconsistency Message passing and message processing overheads Possible Solutions Fast dedicated proprietary and scalable networks and protocols Low latency fast synchronization techniques possibly hardware assisted Hardware assisted message processing in communication assists node tonetwork interfaces Weaker memory consistency models Scalable directory based cache coherence protocols Shared virtual memory Improved software portability standard parallel and distributed operating system support Software latency hiding techniques EECC756 Shaaban 2 lec 13 Spring2002 5 2 2002 One Extreme Limited Scaling of a Bus Characteristic Bus Physical Length 1 ft Number of Connections fixed Maximum Bandwidth fixed Interface to Comm medium memory inf Global Order arbitration Protection Virt physical Trust total OS single comm abstraction HW Poor Scalability Bus Each level of the system design is grounded in the scaling limits at the layers below and assumptions of close coupling between components EECC756 Shaaban 3 lec 13 Spring2002 5 2 2002 Another Extreme Scaling of Workstations in a LAN Characteristic Bus LAN Physical Length 1 ft KM Number of Connections fixed many Maximum Bandwidth fixed Interface to Comm medium memory inf peripheral Global Order arbitration Protection Virt physical OS Trust total none OS single independent comm abstraction HW SW No clear limit to physical scaling no global order consensus difficult to achieve EECC756 Shaaban 4 lec 13 Spring2002 5 2 2002 Bandwidth Scalability Typical switches Bus S S P M M P S M M P S M M P Crossbar M M Multiplexers Depends largely on network characteristics Channel bandwidth Static Topology Node degree Bisection width etc Multistage Switch size and connection pattern properties Node to network interface capabilities EECC756 Shaaban 5 lec 13 Spring2002 5 2 2002 Dancehall MP Organization M M Network bandwidth Bandwidth demand M Scalable network Independent processes Communicating processes Latency Switch Switch P P P Switch P Extremely high demands on network in terms of bandwidth latency even for independent processes EECC756 Shaaban 6 lec 13 Spring2002 5 2 2002 Generic Distributed Memory Organization OS Supported Network protocols Scalable network Communication Assist Extent of functionality Global virtual Shared address space P Switch CA M Message transaction DMA Switch Switch Multi stage interconnection network MIN Custom designed Network bandwidth Bandwidth demand Independent processes Communicating processes Latency O log2P increase Cost scalability of system Node O 10 Bus based SMP Custom designed CPU Node System integration level How far Cray on a Chip SMP on a Chip EECC756 Shaaban 7 lec 13 Spring2002 5 2 2002 Key System Scaling Property Large number of independent communication paths between nodes Allow a large number of concurrent transactions using different channels Transactions are initiated independently No global arbitration Effect of a transaction only visible to the nodes involved Effects propagated through additional transactions EECC756 Shaaban 8 lec 13 Spring2002 5 2 2002 Network Latency Scaling T n Overhead Channel Time Routing Delay Scaling of overhead Channel Time n n B BW at bottleneck RoutingDelay h n EECC756 Shaaban 9 lec 13 Spring2002 5 2 2002 Network Latency Scaling Example O log2 n Stage MIN using switches Max distance log2 n T64 128 1 0 us 2 0 us 6 hops 0 2 us hop 4 2 us T1024 128 1 0 us 2 0 us 10 hops 0 2 us hop 5 0 us Number of switches n log n overhead 1 us BW 64 MB s 200 ns per hop Using pipelined or cut through routing Only 20 increase in latency for 16x size increase Store and Forward T64sf 128 1 0 us 6 hops 2 0 0 2 us hop 14 2 us T1024sf 128 1 0 us 10 hops 2 0 0 2 us hop 23 us EECC756 Shaaban 10 lec 13 Spring2002 5 2 2002 Cost Scaling cost p m fixed cost incremental cost p m Bus Based SMP Ratio of processors memory network I O Parallel efficiency p Speedup P P Similar to speedup one can define Costup p Cost p Cost 1 Cost effective speedup p costup p EECC756 Shaaban 11 lec 13 Spring2002 5 2 2002 Cost Effective 2000 1500 Speedup P 1 logP Costup 1 0 1 P 1000 500 0 0 500 1000 1500 2000 Processors 2048 processors 475 fold speedup at 206x cost EECC756 Shaaban 12 lec 13 Spring2002 5 2 2002 Parallel Machine Network Examples EECC756 Shaaban 13 lec 13 Spring2002 5 2 2002 Physical Scaling Chip level integration Integrate network interface message router I O links nCUBE 2 Alpha 21364 IBM Power 4 IRAM style Cray on a Chip V IRAM Memory Bus controller chip set Alpha 21364 SMP on a chip Chip Multiprocessor CMP IBM Power 4 Board level Replicating using standard microprocessor cores CM 5 replicated the core of a Sun SparkStation 1 workstation Cray T3D and T3E replicated the core of a DEC Alpha workstation System level IBM SP 2 uses 8 16 almost complete RS6000 workstations placed in racks EECC756 Shaaban 14 lec 13 Spring2002 5 2 2002 Chip level integration Example nCUBE 2 Machine Organization Basic module MMU I Fetch decode Operand Router DRAM interface DMA channels 64 nodes socketed on a board Execution unit 64 bit integer IEEE floating point Hypercube network configuration Single chip node 500 000 transistors considered large at the time 13 links up to 8096 nodes possible Entire machine synchronous at 40 MHz EECC756 Shaaban 15 lec 13 Spring2002 5 2 2002 Chip level integration Example Vector Intelligent RAM 2 V IRAM 2 Projected 2003 0 1 m 2 GHz 16 GFLOPS 64b 64 GOPS 16b 128MB 2 way Superscalar Processor I O 8 x 64 or 16 x 32 or 32 x 16 x Vector Instruction Queue I O Load Store Vector Registers 8K I cache 8K D cache 8 x 64 8 x 64 Memory Crossbar Switch M I O M 8 x 64 I O M M M M M M M M 8 x 64 M x 64 8 M M M M M M M M M 8 x 64 M M M M M M 8 x 64 M M M M EECC756 Shaaban 16 lec 13 Spring2002 5 2 2002 Chip level integration Example Alpha 21364 Alpha 21264 core with enhancements Integrated
View Full Document
Unlocking...