WM CSCI 424 - Network Connected Multiprocessors

Unformatted text preview:

Review: Bus Connected SMPs (UMAs)Review: Multiprocessor BasicsNetwork Connected MultiprocessorsSumming 100,000 Numbers on 100 ProcessorsAn Example with 10 ProcessorsCommunication in Network Connected Multi’sCache Coherency in NUMAsIN Performance MetricsBus INRing INFully Connected INCrossbar (Xbar) Connected INHypercube (Binary N-cube) Connected IN2D and 3D Mesh/Torus Connected INFat TreeIN ComparisonSlide 19IBM BlueGeneA BlueGene/L ChipNetworks of Workstations (NOWs) ClustersCommercial (NOW) ClustersSummaryNetworkMultis.1Review: Bus Connected SMPs (UMAs)Caches are used to reduce latency and to lower bus trafficMust provide hardware for cache coherence and process synchronizationBus traffic and bandwidth limits scalability (<~ 36 processors)ProcessorProcessor ProcessorCache Cache CacheSingle BusMemoryI/OProcessorCacheNetworkMultis.2Review: Multiprocessor Basics# of ProcCommunication modelMessage passing 8 to 2048Shared addressNUMA 8 to 256UMA 2 to 64Physical connectionNetwork 8 to 256Bus 2 to 36Q1 – How do they share data?Q2 – How do they coordinate?Q3 – How scalable is the architecture? How many processors?NetworkMultis.3Network Connected MultiprocessorsEither a single address space (NUMA and ccNUMA) with implicit processor communication via loads and stores or multiple private memories with message passing communication with sends and receivesInterconnection network supports interprocessor communicationProcessorProcessor ProcessorCache Cache CacheInterconnection Network (IN)Memory Memory MemoryNetworkMultis.4Summing 100,000 Numbers on 100 Processorssum = 0;for (i = 0; i<1000; i = i + 1)sum = sum + Al[i]; /* sum local array subsetStart by distributing 1000 elements of vector A to each of the local memories and summing each subset in parallelThe processors then coordinate in adding together the sub sums (Pn is the number of processors, send(x,y) sends value y to processor x, and receive() receives a value)half = 100;limit = 100;repeathalf = (half+1)/2; /*dividing line if (Pn>= half && Pn<limit) send(Pn-half,sum); if (Pn<(limit/2)) sum = sum + receive(); limit = half;until (half == 1); /*final sum in P0’s sumNetworkMultis.5An Example with 10 ProcessorsP0 P1 P2 P3 P4 P5 P6 P7 P8 P9sum sum sum sum sum sum sum sum sum sumhalf = 10NetworkMultis.7Communication in Network Connected Multi’sImplicit communication via loads and storeshardware designers have to provide coherent caches and process synchronization primitive lower communication overhead harder to overlap computation with communicationmore efficient to use an address to remote data when de manded rather than to send for it in case it might be used (such a machine has distributed shared memory (DSM))Explicit communication via sends and receivessimplest solution for hardware designershigher communication overhead easier to overlap computation with communication easier for the programmer to optimize communicationNetworkMultis.8Cache Coherency in NUMAsFor performance reasons we want to allow the shared data to be stored in cachesOnce again have multiple copies of the same data with the same address in different processorsbus snooping won’t work, since there is no single bus on which all memory references are broadcastDirectory-base protocolskeep a directory that is a repository for the state of every block in main memory (which caches have copies, whether it is dirty, etc.)directory entries can be distributed (sharing status of a block always in a single known location) to reduce contentiondirectory controller sends explicit commands over the IN to each processor that has a copy of the dataNetworkMultis.9IN Performance MetricsNetwork costnumber of switchesnumber of (bidirectional) links on a switch to connect to the network (plus one link to connect to the processor)width in bits per link, length of linkNetwork bandwidth (NB) – represents the best casebandwidth of each link * number of linksBisection bandwidth (BB) – represents the worst casedivide the machine in two parts, each with half the nodes and sum the bandwidth of the links that cross the dividing lineOther IN performance issueslatency on an unloaded network to send and receive messagesthroughput – maximum # of messages transmitted per unit time# routing hops worst case, congestion control and delayNetworkMultis.10Bus INN processors, 1 switch ( ), 1 link (the bus)Only 1 simultaneous transfer at a timeNB = link (bus) bandwidth * 1BB = link (bus) bandwidth * 1Processor nodeBidirectionalnetwork switchNetworkMultis.11Ring INIf a link is as fast as a bus, the ring is only twice as fast as a bus in the worst case, but is N times faster in the best caseN processors, N switches, 2 links/switch, N linksN simultaneous transfersNB = link bandwidth * NBB = link bandwidth * 2NetworkMultis.12Fully Connected INN processors, N switches, N-1 links/switch, (N*(N-1))/2 linksN simultaneous transfersNB = link bandwidth * (N*(N-1))/2BB = link bandwidth * (N/2)2NetworkMultis.13Crossbar (Xbar) Connected INN processors, N2 switches (unidirectional),2 links/switch, N2 linksN simultaneous transfersNB = link bandwidth * NBB = link bandwidth * N/2NetworkMultis.14Hypercube (Binary N-cube) Connected INN processors, N switches, logN links/switch, (NlogN)/2 linksN simultaneous transfersNB = link bandwidth * (NlogN)/2BB = link bandwidth * N/22-cube 3-cubeNetworkMultis.152D and 3D Mesh/Torus Connected INN simultaneous transfersN processors, N switches, 2, 3, 4 (2D torus) or 6 (3D torus) links/switch, 4N/2 links or 6N/2 linksNetworkMultis.16Fat TreeC DA BTrees are good structures. People in CS use them all the time. Suppose we wanted to make a tree network.Any time A wants to send to C, it ties up the upper links, so that B can't send to D. The bisection bandwidth on a tree is horrible - 1 link, at all timesThe solution is to 'thicken' the upper links. More links as the tree gets thicker increases the bisectionRather than design a bunch of N-port switches, use pairsNetworkMultis.17IN ComparisonFor a 64 processor systemBus Ring Torus 6-cube Fully connectedNetwork bandwidth1Bisection bandwidth1Total # of Switches 1Links per switchTotal # of links1NetworkMultis.19Network Connected MultiprocessorsProc Proc Speed# Proc IN TopologyBW/link (MB/sec)SGI


View Full Document

WM CSCI 424 - Network Connected Multiprocessors

Download Network Connected Multiprocessors
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Network Connected Multiprocessors and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Network Connected Multiprocessors 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?