Unformatted text preview:

Review: Where are We Now?Multiprocessor BasicsSingle Bus (Shared Address UMA) Multi’sMultiprocessor Cache CoherencyBus Snooping ProtocolsHandling WritesSlide 9SMP Data Miss RatesBlock Size EffectsOther Coherence ProtocolsProcess SynchronizationSpin Lock SynchronizationReview: Summing Numbers on a SMPAn Example with 10 ProcessorsBarrier Implemented with Spin-LocksSpin-Locks on Bus Connected ccUMAsCommercial Single Backplane MultiprocessorsSummaryBusMultis.1Review: Where are We Now? ProcessorControlDatapathMemoryInputOutputInputOutputMemory ProcessorControlDatapathMultiprocessor – multiple processors with a single shared address spaceCluster – multiple computers (each with their own address space) connected over a local area network (LAN) functioning as a single systemBusMultis.2Multiprocessor Basics# of ProcCommunication modelMessage passing 8 to 2048Shared addressNUMA 8 to 256UMA 2 to 64Physical connectionNetwork 8 to 256Bus 2 to 36Q1 – How do they share data?Q2 – How do they coordinate?Q3 – How scalable is the architecture? How many processors?BusMultis.3Single Bus (Shared Address UMA) Multi’sCaches are used to reduce latency and to lower bus trafficWrite-back caches used to keep bus traffic at a minimumMust provide hardware to ensure that caches and memory are consistent (cache coherency)Must provide a hardware mechanism to support process synchronizationProc1Proc2 Proc4Caches Caches CachesSingle BusMemoryI/OProc3CachesBusMultis.4Multiprocessor Cache CoherencyCache coherency protocolsBus snooping – cache controllers monitor shared bus traffic with duplicate address tag hardware (so they don’t interfere with processor’s access to the cache) Proc1 Proc2 ProcNDCache DCache DCacheSingle BusMemory I/OSnoop Snoop SnoopBusMultis.5Bus Snooping ProtocolsMultiple copies are not a problem when readingProcessor must have exclusive access to write a wordWhat happens if two processors try to write to the same shared data word in the same clock cycle? The bus arbiter decides which processor gets the bus first (and this will be the processor with the first exclusive access). Then the second processor will get exclusive access. Thus, bus arbitration forces sequential behavior.This sequential consistency is the most conservative of the memory consistency models. With it, the result of any execution is the same as if the accesses of each processor were kept in order and the accesses among different processors were interleaved.All other processors sharing that data must be informed of writesBusMultis.6Handling WritesEnsuring that all other processors sharing data are informed of writes can be handled two ways:1. Write-update (write-broadcast) – writing processor broadcasts new data over the bus, all copies are updatedAll writes go to the bus  higher bus trafficSince new values appear in caches sooner, can reduce latency2. Write-invalidate – writing processor issues invalidation signal on bus, cache snoops check to see if they have a copy of the data, if so they invalidate their cache block containing the word (this allows multiple readers but only one writer)Uses the bus only on the first write  lower bus traffic, so better use of bus bandwidthBusMultis.9Write-Invalidate CC ExamplesI = invalid (many), S = shared (many), M = modified (only one)Proc 1 A S Main Mem A Proc 2 A I 1. read miss for A2. read request for A3. snoop sees read request for A & lets MM supply A4. gets A from MM & changes its state to SProc 1 A S Main Mem A Proc 2 A I 1. write miss for A2. writes A & changes its state to MProc 1 A MMain Mem A Proc 2 A I 1. read miss for A3. snoop sees read request for A, writes-back A to MM2. read request for A4. gets A from MM & changes its state to M3. P2 sends invalidate for A4. change A state to I5. P2 sends invalidate for A5. change A state to IProc 1 A M Main Mem A Proc 2 A I 1. write miss for A2. writes A & changes its state to M3. P2 sends invalidate for A4. change A state to IBusMultis.10SMP Data Miss RatesShared data has lower spatial and temporal localityShare data misses often dominate cache behavior even though they may only be 10% to 40% of the data accessesFFT024681 2 4 8 16Capacity miss rateCoherence miss rateOcean0246810121416181 2 4 8 16Capacity miss rateCoherence miss rate64KB 2-way set associative data cache with 32B blocksHennessy & Patterson, Computer Architecture: A Quantitative ApproachBusMultis.11Block Size EffectsWrites to one word in a multi-word block meaneither the full block is invalidated (write-invalidate)or the full block is exchanged between processors (write-update)-Alternatively, could broadcast only the written wordMulti-word blocks can also result in false sharing: when two processors are writing to two different variables in the same cache blockWith write-invalidate false sharing increases cache miss rates Compilers can help reduce false sharing by allocating highly correlated data to the same cache blockA BProc1 Proc24 word cache blockBusMultis.12Other Coherence ProtocolsThere are many variations on cache coherence protocolsAnother write-invalidate protocol used in the Pentium 4 (and many other micro’s) is MESI with four states:Modified – sameExclusive – only one copy of the shared data is allowed to be cached; memory has an up-to-date copy-Since there is only one copy of the block, write hits don’t need to send invalidate signalShared – multiple copies of the shared data may be cached (i.e., data permitted to be cached with more than one processor); memory has an up-to-date copyInvalid – sameBusMultis.14Process SynchronizationNeed to be able to coordinate processes working on a common taskLock variables (semaphores) are used to coordinate or synchronize processesNeed an architecture-supported arbitration mechanism to decide which processor gets access to the lock variableSingle bus provides arbitration mechanism, since the bus is the only path to memory – the processor that gets the bus winsNeed an architecture-supported operation that locks the variableLocking can be done via an atomic swap operation (processor can both read a location and set it to the locked state – test-and-set – in the same bus operation)BusMultis.15Spin Lock SynchronizationRead lock variableSucceed? (=0?)Try to lock variable using swap: read lock variable and set it to locked


View Full Document

WM CSCI 424 - Multiprocessor Basics

Download Multiprocessor Basics
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Multiprocessor Basics and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Multiprocessor Basics 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?