1Synchronization and Synchronization and Communication in the Communication in the T3ET3EMultiprocessorMultiprocessorSteven L. ScottSteven L. ScottCray Research, IncCray Research, IncPresented by Presented by HariHariSivaramakrishnanSivaramakrishnanT3E FeaturesT3E FeaturesDistributed shared memory systemDistributed shared memory system••Up to 2GB memory per processorUp to 2GB memory per processor••DEC Alpha 21164 processorDEC Alpha 21164 processor••Shell Shell ––control and router chips, control and router chips, memory memory T3E FeaturesT3E FeaturesBufferingBuffering••Buffers can detect multiple interleaved streamsBuffers can detect multiple interleaved streamsLocal memory cachedLocal memory cached••No onboard cacheNo onboard cache••External External backmapbackmapto maintain data consistencyto maintain data consistencyEE--RegistersRegisters••512 user + 128 system512 user + 128 system••Remote communication and synchronizationRemote communication and synchronization••Highly pipelinedHighly pipelined••Extend the processorExtend the processor’’s physical address spaces physical address spaceGlobal CommunicationGlobal CommunicationOperations performed on EOperations performed on E--RegistersRegisters••Direct loads, stores between EDirect loads, stores between E--registers registers and processor registersand processor registers••Global operations (message passing, Global operations (message passing, synchronization, remote loads)synchronization, remote loads)Global referencesGlobal references••Global Virtual Address (GVA)Global Virtual Address (GVA)Address TranslationAddress TranslationGlobal Virtual Address (GVA)Global Virtual Address (GVA)Virtual PE numberVirtual PE numberCentrifugeCentrifuge••Mask, index, base Mask, index, base Should be only 6 bits, not 8Source or destination2Get and Put operationsGet and Put operationsReads and writes to global EReads and writes to global E--RegistersRegisters••Single word or a vectorSingle word or a vectorFlags on each register for synchronizationFlags on each register for synchronization••EmptyEmpty••FullFull••Memory to memory copy through EMemory to memory copy through E--registersregistersDoes not touch processor busDoes not touch processor bus••No RAW hazardsNo RAW hazardsHighly pipelinedHighly pipelined••256 bytes in 26.7ns can be issued256 bytes in 26.7ns can be issued••Large number of ELarge number of E--registersregisters••Max transfer rate = 480MB/s between two nodesMax transfer rate = 480MB/s between two nodesAtomic Memory OperationsAtomic Memory OperationsT3D used dedicated SWAP registersT3D used dedicated SWAP registersT3E uses memory locationsT3E uses memory locationsUniversal constructHow to perform an AMO?How to perform an AMO?Operands written to EOperands written to E--registersregistersStore to I/O space to trigger operationStore to I/O space to trigger operationAtomic Memory Operation packet sent to particular memory Atomic Memory Operation packet sent to particular memory locationlocationResult returned to EResult returned to E--Register specified on the address lineRegister specified on the address lineMost Most AMOsAMOsneed a readneed a read--modifymodify--write of RAMwrite of RAM••11 11 sysclockssysclocksat 147ns per clockat 147ns per clock••8M operations per second8M operations per secondHigh bandwidth High bandwidth fetch_and_incfetch_and_incserved out of buffer at served out of buffer at memory controller for each nodememory controller for each nodeMessagesMessagesT3DT3D••Single hardware message queue for user and system Single hardware message queue for user and system messagesmessages••Every message generates an interruptEvery message generates an interruptT3ET3E••Arbitrary number of message queuesArbitrary number of message queues••Mapped to memoryMapped to memory••Queue max size = 128 MB. Message size = 64 bytesQueue max size = 128 MB. Message size = 64 bytesMessage notificationMessage notification••Always interruptAlways interrupt••Never interrupt (polling)Never interrupt (polling)••Interrupt on a thresholdInterrupt on a thresholdMessage passing and shared memory integrationMessage passing and shared memory integrationMessage Queue Control WordMessage Queue Control WordDescriptor for a message queueDescriptor for a message queueMessages rejected when queue is fullMessages rejected when queue is fullIf message insertion creates a segmentation If message insertion creates a segmentation violation, violation, nacknackis returnedis returnedSending MessagesSending MessagesMessages assembled in an aligned block of 8 EMessages assembled in an aligned block of 8 E--RegistersRegistersSent to address of MQCWSent to address of MQCWMQCW updates and message storage are atomicMQCW updates and message storage are atomicEE--Registers status is set to empty on sendRegisters status is set to empty on send••If message accepted, changed to fullIf message accepted, changed to full••If message rejected, changed to fullIf message rejected, changed to full--sendsend--rejectedrejected3Barrier/Eureka SynchronizationBarrier/Eureka SynchronizationBarrierBarrier••Wait for Wait for allallprocessors to signal an eventprocessors to signal an eventEurekaEureka••Wait for Wait for somesomeprocessor to signal an eventprocessor to signal an eventBarrier/Eureka Synchronization unitsBarrier/Eureka Synchronization units••32 32 BSUsBSUs••MemoryMemory--mappedmapped••Set of processors given access to a BSUSet of processors given access to a BSUBarrier/Eureka TreesBarrier/Eureka TreesBarrier/Eureka network embedded in torus Barrier/Eureka network embedded in torus interconnectinterconnect••Keeps latency lower than a remote referenceKeeps latency lower than a remote referenceNetwork router has a register for each BSUNetwork router has a register for each BSU••Node can be configured as internal in BSU treeNode can be configured as internal in BSU tree••Information about which of six network directions is the Information about which of six network directions is the parentparentEurekasEurekasand Barrier notifications are sent to the and Barrier notifications are
View Full Document