WM CSCI 424 - Network Connected Multiprocessors - D2602408

Home> Schools> College of William & Mary> Computer Science (CSCI) > CSCI 424> Network Connected Multiprocessors

WM CSCI 424 - Network Connected Multiprocessors

Pages 22

Download Save

Unformatted text preview:

Review: Bus Connected SMPs (UMAs)Review: Multiprocessor BasicsNetwork Connected MultiprocessorsSumming 100,000 Numbers on 100 ProcessorsAn Example with 10 ProcessorsCommunication in Network Connected Multi’sCache Coherency in NUMAsIN Performance MetricsBus INRing INFully Connected INCrossbar (Xbar) Connected INHypercube (Binary N-cube) Connected IN2D and 3D Mesh/Torus Connected INFat TreeIN ComparisonSlide 19IBM BlueGeneA BlueGene/L ChipNetworks of Workstations (NOWs) ClustersCommercial (NOW) ClustersSummaryNetworkMultis.1Review: Bus Connected SMPs (UMAs)Caches are used to reduce latency and to lower bus trafficMust provide hardware for cache coherence and process synchronizationBus traffic and bandwidth limits scalability (<~ 36 processors)ProcessorProcessor ProcessorCache Cache CacheSingle BusMemoryI/OProcessorCacheNetworkMultis.2Review: Multiprocessor Basics# of ProcCommunication modelMessage passing 8 to 2048Shared addressNUMA 8 to 256UMA 2 to 64Physical connectionNetwork 8 to 256Bus 2 to 36Q1 – How do they share data?Q2 – How do they coordinate?Q3 – How scalable is the architecture? How many processors?NetworkMultis.3Network Connected MultiprocessorsEither a single address space (NUMA and ccNUMA) with implicit processor communication via loads and stores or multiple private memories with message passing communication with sends and receivesInterconnection network supports interprocessor communicationProcessorProcessor ProcessorCache Cache CacheInterconnection Network (IN)Memory Memory MemoryNetworkMultis.4Summing 100,000 Numbers on 100 Processorssum = 0;for (i = 0; i<1000; i = i + 1)sum = sum + Al[i]; /* sum local array subsetStart by distributing 1000 elements of vector A to each of the local memories and summing each subset in parallelThe processors then coordinate in adding together the sub sums (Pn is the number of processors, send(x,y) sends value y to processor x, and receive() receives a value)half = 100;limit = 100;repeathalf = (half+1)/2; /*dividing line if (Pn>= half && Pn<limit) send(Pn-half,sum); if (Pn<(limit/2)) sum = sum + receive(); limit = half;until (half == 1); /*final sum in P0’s sumNetworkMultis.5An Example with 10 ProcessorsP0 P1 P2 P3 P4 P5 P6 P7 P8 P9sum sum sum sum sum sum sum sum sum sumhalf = 10NetworkMultis.7Communication in Network Connected Multi’sImplicit communication via loads and storeshardware designers have to provide coherent caches and process synchronization primitive lower communication overhead harder to overlap computation with communicationmore efficient to use an address to remote data when de manded rather than to send for it in case it might be used (such a machine has distributed shared memory (DSM))Explicit communication via sends and receivessimplest solution for hardware designershigher communication overhead easier to overlap computation with communication easier for the programmer to optimize communicationNetworkMultis.8Cache Coherency in NUMAsFor performance reasons we want to allow the shared data to be stored in cachesOnce again have multiple copies of the same data with the same address in different processorsbus snooping won’t work, since there is no single bus on which all memory references are broadcastDirectory-base protocolskeep a directory that is a repository for the state of every block in main memory (which caches have copies, whether it is dirty, etc.)directory entries can be distributed (sharing status of a block always in a single known location) to reduce contentiondirectory controller sends explicit commands over the IN to each processor that has a copy of the dataNetworkMultis.9IN Performance MetricsNetwork costnumber of switchesnumber of (bidirectional) links on a switch to connect to the network (plus one link to connect to the processor)width in bits per link, length of linkNetwork bandwidth (NB) – represents the best casebandwidth of each link * number of linksBisection bandwidth (BB) – represents the worst casedivide the machine in two parts, each with half the nodes and sum the bandwidth of the links that cross the dividing lineOther IN performance issueslatency on an unloaded network to send and receive messagesthroughput – maximum # of messages transmitted per unit time# routing hops worst case, congestion control and delayNetworkMultis.10Bus INN processors, 1 switch ( ), 1 link (the bus)Only 1 simultaneous transfer at a timeNB = link (bus) bandwidth * 1BB = link (bus) bandwidth * 1Processor nodeBidirectionalnetwork switchNetworkMultis.11Ring INIf a link is as fast as a bus, the ring is only twice as fast as a bus in the worst case, but is N times faster in the best caseN processors, N switches, 2 links/switch, N linksN simultaneous transfersNB = link bandwidth * NBB = link bandwidth * 2NetworkMultis.12Fully Connected INN processors, N switches, N-1 links/switch, (N*(N-1))/2 linksN simultaneous transfersNB = link bandwidth * (N*(N-1))/2BB = link bandwidth * (N/2)2NetworkMultis.13Crossbar (Xbar) Connected INN processors, N2 switches (unidirectional),2 links/switch, N2 linksN simultaneous transfersNB = link bandwidth * NBB = link bandwidth * N/2NetworkMultis.14Hypercube (Binary N-cube) Connected INN processors, N switches, logN links/switch, (NlogN)/2 linksN simultaneous transfersNB = link bandwidth * (NlogN)/2BB = link bandwidth * N/22-cube 3-cubeNetworkMultis.152D and 3D Mesh/Torus Connected INN simultaneous transfersN processors, N switches, 2, 3, 4 (2D torus) or 6 (3D torus) links/switch, 4N/2 links or 6N/2 linksNetworkMultis.16Fat TreeC DA BTrees are good structures. People in CS use them all the time. Suppose we wanted to make a tree network.Any time A wants to send to C, it ties up the upper links, so that B can't send to D. The bisection bandwidth on a tree is horrible - 1 link, at all timesThe solution is to 'thicken' the upper links. More links as the tree gets thicker increases the bisectionRather than design a bunch of N-port switches, use pairsNetworkMultis.17IN ComparisonFor a 64 processor systemBus Ring Torus 6-cube Fully connectedNetwork bandwidth1Bisection bandwidth1Total # of Switches 1Links per switchTotal # of links1NetworkMultis.19Network Connected MultiprocessorsProc Proc Speed# Proc IN TopologyBW/link (MB/sec)SGI

View Full Document


School:
Email:
New Password:
Confirm Password:

WM CSCI 424 - Network Connected Multiprocessors

Sign up for free to view:

Please select your school