1Lecture 18: Large Caches, Multiprocessors• Today: NUCA caches, multiprocessors (Sections 4.1-4.2)• Reminder: assignment 5 due Thursday (don’tprocrastinate!)Core 0L1D$L1I$L2 $Core 1L1D$L1I$L2 $Core 2L1D$L1I$L2 $Core 3L1D$L1I$L2 $Core 4L1D$L1I$L2 $Core 5L1D$L1I$L2 $Core 6L1D$L1I$L2 $Core 7L1D$L1I$L2 $Memory Controller for off-chip accessA single tile composedof a core, L1 caches, anda bank (slice) of theshared L2 cacheThe cache controller forwards address requeststo the appropriate L2 bankand handles coherenceoperationsDistributed Shared Cache3Distributed Shared Cache• The L2 (or L3) can be a large shared cache, but isphysically partitioned into banks and distributed on chip• Each core (tile) has one L2 cache bank adjacent to it• One bank stores a subset of “sets” and all ways for that set• OS-based first-touch page coloring can force a thread’spages to have physical page numbers that map to thethread’s local L2 bankPhysical Address | color |Physical page #CacheIndex4UCA and NUCA• The small-sized caches so far have all been uniform cacheaccess: the latency for any access is a constant, no matterwhere data is found• For a large multi-megabyte cache, it is expensive to limitaccess time by the worst case delay: hence, non-uniformcache architecture• The distributed shared cache is an example of a NUCAcache: variable latency to each bank5NUCA Design Space• Distribute Sets: Static-NUCA: Each block has a uniquelocation; easy to find data; page coloring for locality; pagemigration if initial mapping is sub-optimal• Distribute Ways: Dynamic-NUCA: More flexibility in blockplacement; complicated search mechanisms; blocksmigrate to be closer to their accessor• Private data are easy to handle; Shared data must beplaced at the center-of-gravity of accesses6Prefetching• Hardware prefetching can be employed for any of thecache levels• It can introduce cache pollution – prefetched data isoften placed in a separate prefetch buffer to avoidpollution – this buffer must be looked up in parallelwith the cache access• Aggressive prefetching increases “coverage”, but leadsto a reduction in “accuracy” wasted memory bandwidth• Prefetches must be timely: they must be issued sufficientlyin advance to hide the latency, but not too early (to avoidpollution and eviction before use)7Stream Buffers• Simplest form of prefetch: on every miss, bring inmultiple cache lines• When you read the top of the queue, bring in the next lineL1Stream bufferSequential lines8Stride-Based Prefetching• For each load, keep track of the last address accessedby the load and a possibly consistent stride• FSM detects consistent stride and issues prefetchesinittranssteadyno-predincorrectcorrectincorrect(update stride)correctcorrectcorrectincorrect(update stride)incorrect(update stride)tag prev_addr stride statePC9Taxonomy• SISD: single instruction and single data stream: uniprocessor• MISD: no commercial multiprocessor: imagine data goingthrough a pipeline of execution engines• SIMD: vector architectures: lower flexibility• MIMD: most multiprocessors today: easy to construct withoff-the-shelf computers, most flexibility10Memory Organization - I• Centralized shared-memory multiprocessor orSymmetric shared-memory multiprocessor (SMP)• Multiple processors connected to a single centralizedmemory – since all processors see the same memoryorganization uniform memory access (UMA)• Shared-memory because all processors can access theentire memory address space• Can centralized memory emerge as a bandwidthbottleneck? – not if you have large caches and employfewer than a dozen processors11SMPs or Centralized Shared-MemoryProcessorCachesProcessorCachesProcessorCachesProcessorCachesMain MemoryI/O System12Memory Organization - II• For higher scalability, memory is distributed amongprocessors distributed memory multiprocessors• If one processor can directly address the memory localto another processor, the address space is shared distributed shared-memory (DSM) multiprocessor• If memories are strictly local, we need messages tocommunicate data cluster of computers or multicomputers• Non-uniform memory architecture (NUMA) since localmemory has lower latency than remote memory13Distributed Memory MultiprocessorsProcessor& CachesMemory I/OProcessor& CachesMemory I/OProcessor& CachesMemory I/OProcessor& CachesMemory I/OInterconnection network14Shared-Memory Vs. Message-PassingShared-memory:• Well-understood programming model• Communication is implicit and hardware handles protection• Hardware-controlled cachingMessage-passing:• No cache coherence simpler hardware• Explicit communication easier for the programmer torestructure code• Sender can initiate data transfer15Ocean KernelProcedure Solve(A)begindiff = done = 0;while (!done) dodiff = 0;for i 1 to n dofor j 1 to n dotemp = A[i,j];A[i,j] 0.2 * (A[i,j] + neighbors);diff += abs(A[i,j] – temp);end forend forif (diff < TOL) then done = 1;end whileend procedure16Shared Address Space Modelint n, nprocs;float **A, diff;LOCKDEC(diff_lock);BARDEC(bar1);main()beginread(n); read(nprocs);A G_MALLOC();initialize (A);CREATE (nprocs,Solve,A);WAIT_FOR_END (nprocs);end mainprocedure Solve(A)int i, j, pid, done=0;float temp, mydiff=0;int mymin = 1 + (pid * n/procs);int mymax = mymin + n/nprocs -1;while (!done) domydiff = diff = 0;BARRIER(bar1,nprocs);for i mymin to mymaxfor j 1 to n do…endforendforLOCK(diff_lock);diff += mydiff;UNLOCK(diff_lock);BARRIER (bar1, nprocs);if (diff < TOL) then done = 1;BARRIER (bar1, nprocs);endwhile17Message Passing Modelmain()read(n); read(nprocs);CREATE (nprocs-1, Solve);Solve();WAIT_FOR_END (nprocs-1);procedure Solve()int i, j, pid, nn = n/nprocs, done=0;float temp, tempdiff, mydiff = 0;myA malloc(…)initialize(myA);while (!done) domydiff = 0;if (pid != 0) SEND(&myA[1,0], n, pid-1, ROW);if (pid != nprocs-1)SEND(&myA[nn,0], n, pid+1, ROW);if (pid != 0)RECEIVE(&myA[0,0], n, pid-1, ROW);if (pid != nprocs-1)RECEIVE(&myA[nn+1,0], n, pid+1, ROW);for i 1 to nn dofor j 1 to n do…endforendforif (pid != 0)SEND(mydiff, 1, 0, DIFF);RECEIVE(done, 1, 0, DONE);elsefor i 1 to nprocs-1 doRECEIVE(tempdiff, 1, *, DIFF);mydiff += tempdiff;endforif (mydiff < TOL) done = 1;for i 1 to nprocs-1 doSEND(done, 1, I, DONE);endforendifendwhile18Title•
View Full Document