Review: Where are We Now?Multiprocessor BasicsSingle Bus (Shared Address UMA) Multi’sMultiprocessor Cache CoherencyBus Snooping ProtocolsHandling WritesSlide 9SMP Data Miss RatesBlock Size EffectsOther Coherence ProtocolsProcess SynchronizationSpin Lock SynchronizationReview: Summing Numbers on a SMPAn Example with 10 ProcessorsBarrier Implemented with Spin-LocksSpin-Locks on Bus Connected ccUMAsCommercial Single Backplane MultiprocessorsSummaryBusMultis.1Review: Where are We Now? ProcessorControlDatapathMemoryInputOutputInputOutputMemory ProcessorControlDatapathMultiprocessor – multiple processors with a single shared address spaceCluster – multiple computers (each with their own address space) connected over a local area network (LAN) functioning as a single systemBusMultis.2Multiprocessor Basics# of ProcCommunication modelMessage passing 8 to 2048Shared addressNUMA 8 to 256UMA 2 to 64Physical connectionNetwork 8 to 256Bus 2 to 36Q1 – How do they share data?Q2 – How do they coordinate?Q3 – How scalable is the architecture? How many processors?BusMultis.3Single Bus (Shared Address UMA) Multi’sCaches are used to reduce latency and to lower bus trafficWrite-back caches used to keep bus traffic at a minimumMust provide hardware to ensure that caches and memory are consistent (cache coherency)Must provide a hardware mechanism to support process synchronizationProc1Proc2 Proc4Caches Caches CachesSingle BusMemoryI/OProc3CachesBusMultis.4Multiprocessor Cache CoherencyCache coherency protocolsBus snooping – cache controllers monitor shared bus traffic with duplicate address tag hardware (so they don’t interfere with processor’s access to the cache) Proc1 Proc2 ProcNDCache DCache DCacheSingle BusMemory I/OSnoop Snoop SnoopBusMultis.5Bus Snooping ProtocolsMultiple copies are not a problem when readingProcessor must have exclusive access to write a wordWhat happens if two processors try to write to the same shared data word in the same clock cycle? The bus arbiter decides which processor gets the bus first (and this will be the processor with the first exclusive access). Then the second processor will get exclusive access. Thus, bus arbitration forces sequential behavior.This sequential consistency is the most conservative of the memory consistency models. With it, the result of any execution is the same as if the accesses of each processor were kept in order and the accesses among different processors were interleaved.All other processors sharing that data must be informed of writesBusMultis.6Handling WritesEnsuring that all other processors sharing data are informed of writes can be handled two ways:1. Write-update (write-broadcast) – writing processor broadcasts new data over the bus, all copies are updatedAll writes go to the bus higher bus trafficSince new values appear in caches sooner, can reduce latency2. Write-invalidate – writing processor issues invalidation signal on bus, cache snoops check to see if they have a copy of the data, if so they invalidate their cache block containing the word (this allows multiple readers but only one writer)Uses the bus only on the first write lower bus traffic, so better use of bus bandwidthBusMultis.9Write-Invalidate CC ExamplesI = invalid (many), S = shared (many), M = modified (only one)Proc 1 A S Main Mem A Proc 2 A I 1. read miss for A2. read request for A3. snoop sees read request for A & lets MM supply A4. gets A from MM & changes its state to SProc 1 A S Main Mem A Proc 2 A I 1. write miss for A2. writes A & changes its state to MProc 1 A MMain Mem A Proc 2 A I 1. read miss for A3. snoop sees read request for A, writes-back A to MM2. read request for A4. gets A from MM & changes its state to M3. P2 sends invalidate for A4. change A state to I5. P2 sends invalidate for A5. change A state to IProc 1 A M Main Mem A Proc 2 A I 1. write miss for A2. writes A & changes its state to M3. P2 sends invalidate for A4. change A state to IBusMultis.10SMP Data Miss RatesShared data has lower spatial and temporal localityShare data misses often dominate cache behavior even though they may only be 10% to 40% of the data accessesFFT024681 2 4 8 16Capacity miss rateCoherence miss rateOcean0246810121416181 2 4 8 16Capacity miss rateCoherence miss rate64KB 2-way set associative data cache with 32B blocksHennessy & Patterson, Computer Architecture: A Quantitative ApproachBusMultis.11Block Size EffectsWrites to one word in a multi-word block meaneither the full block is invalidated (write-invalidate)or the full block is exchanged between processors (write-update)-Alternatively, could broadcast only the written wordMulti-word blocks can also result in false sharing: when two processors are writing to two different variables in the same cache blockWith write-invalidate false sharing increases cache miss rates Compilers can help reduce false sharing by allocating highly correlated data to the same cache blockA BProc1 Proc24 word cache blockBusMultis.12Other Coherence ProtocolsThere are many variations on cache coherence protocolsAnother write-invalidate protocol used in the Pentium 4 (and many other micro’s) is MESI with four states:Modified – sameExclusive – only one copy of the shared data is allowed to be cached; memory has an up-to-date copy-Since there is only one copy of the block, write hits don’t need to send invalidate signalShared – multiple copies of the shared data may be cached (i.e., data permitted to be cached with more than one processor); memory has an up-to-date copyInvalid – sameBusMultis.14Process SynchronizationNeed to be able to coordinate processes working on a common taskLock variables (semaphores) are used to coordinate or synchronize processesNeed an architecture-supported arbitration mechanism to decide which processor gets access to the lock variableSingle bus provides arbitration mechanism, since the bus is the only path to memory – the processor that gets the bus winsNeed an architecture-supported operation that locks the variableLocking can be done via an atomic swap operation (processor can both read a location and set it to the locked state – test-and-set – in the same bus operation)BusMultis.15Spin Lock SynchronizationRead lock variableSucceed? (=0?)Try to lock variable using swap: read lock variable and set it to locked
View Full Document