U of U CS 6810 - Large Caches, Multiprocessors

Unformatted text preview:

1Lecture 18: Large Caches, Multiprocessors• Today: NUCA caches, multiprocessors (Sections 4.1-4.2)• Reminder: assignment 5 due Thursday (don’tprocrastinate!)Core 0L1D$L1I$L2 $Core 1L1D$L1I$L2 $Core 2L1D$L1I$L2 $Core 3L1D$L1I$L2 $Core 4L1D$L1I$L2 $Core 5L1D$L1I$L2 $Core 6L1D$L1I$L2 $Core 7L1D$L1I$L2 $Memory Controller for off-chip accessA single tile composedof a core, L1 caches, anda bank (slice) of theshared L2 cacheThe cache controller forwards address requeststo the appropriate L2 bankand handles coherenceoperationsDistributed Shared Cache3Distributed Shared Cache• The L2 (or L3) can be a large shared cache, but isphysically partitioned into banks and distributed on chip• Each core (tile) has one L2 cache bank adjacent to it• One bank stores a subset of “sets” and all ways for that set• OS-based first-touch page coloring can force a thread’spages to have physical page numbers that map to thethread’s local L2 bankPhysical Address | color |Physical page #CacheIndex4UCA and NUCA• The small-sized caches so far have all been uniform cacheaccess: the latency for any access is a constant, no matterwhere data is found• For a large multi-megabyte cache, it is expensive to limitaccess time by the worst case delay: hence, non-uniformcache architecture• The distributed shared cache is an example of a NUCAcache: variable latency to each bank5NUCA Design Space• Distribute Sets: Static-NUCA: Each block has a uniquelocation; easy to find data; page coloring for locality; pagemigration if initial mapping is sub-optimal• Distribute Ways: Dynamic-NUCA: More flexibility in blockplacement; complicated search mechanisms; blocksmigrate to be closer to their accessor• Private data are easy to handle; Shared data must beplaced at the center-of-gravity of accesses6Prefetching• Hardware prefetching can be employed for any of thecache levels• It can introduce cache pollution – prefetched data isoften placed in a separate prefetch buffer to avoidpollution – this buffer must be looked up in parallelwith the cache access• Aggressive prefetching increases “coverage”, but leadsto a reduction in “accuracy”  wasted memory bandwidth• Prefetches must be timely: they must be issued sufficientlyin advance to hide the latency, but not too early (to avoidpollution and eviction before use)7Stream Buffers• Simplest form of prefetch: on every miss, bring inmultiple cache lines• When you read the top of the queue, bring in the next lineL1Stream bufferSequential lines8Stride-Based Prefetching• For each load, keep track of the last address accessedby the load and a possibly consistent stride• FSM detects consistent stride and issues prefetchesinittranssteadyno-predincorrectcorrectincorrect(update stride)correctcorrectcorrectincorrect(update stride)incorrect(update stride)tag prev_addr stride statePC9Taxonomy• SISD: single instruction and single data stream: uniprocessor• MISD: no commercial multiprocessor: imagine data goingthrough a pipeline of execution engines• SIMD: vector architectures: lower flexibility• MIMD: most multiprocessors today: easy to construct withoff-the-shelf computers, most flexibility10Memory Organization - I• Centralized shared-memory multiprocessor orSymmetric shared-memory multiprocessor (SMP)• Multiple processors connected to a single centralizedmemory – since all processors see the same memoryorganization  uniform memory access (UMA)• Shared-memory because all processors can access theentire memory address space• Can centralized memory emerge as a bandwidthbottleneck? – not if you have large caches and employfewer than a dozen processors11SMPs or Centralized Shared-MemoryProcessorCachesProcessorCachesProcessorCachesProcessorCachesMain MemoryI/O System12Memory Organization - II• For higher scalability, memory is distributed amongprocessors  distributed memory multiprocessors• If one processor can directly address the memory localto another processor, the address space is shared distributed shared-memory (DSM) multiprocessor• If memories are strictly local, we need messages tocommunicate data  cluster of computers or multicomputers• Non-uniform memory architecture (NUMA) since localmemory has lower latency than remote memory13Distributed Memory MultiprocessorsProcessor& CachesMemory I/OProcessor& CachesMemory I/OProcessor& CachesMemory I/OProcessor& CachesMemory I/OInterconnection network14Shared-Memory Vs. Message-PassingShared-memory:• Well-understood programming model• Communication is implicit and hardware handles protection• Hardware-controlled cachingMessage-passing:• No cache coherence  simpler hardware• Explicit communication  easier for the programmer torestructure code• Sender can initiate data transfer15Ocean KernelProcedure Solve(A)begindiff = done = 0;while (!done) dodiff = 0;for i  1 to n dofor j  1 to n dotemp = A[i,j];A[i,j]  0.2 * (A[i,j] + neighbors);diff += abs(A[i,j] – temp);end forend forif (diff < TOL) then done = 1;end whileend procedure16Shared Address Space Modelint n, nprocs;float **A, diff;LOCKDEC(diff_lock);BARDEC(bar1);main()beginread(n); read(nprocs);A  G_MALLOC();initialize (A);CREATE (nprocs,Solve,A);WAIT_FOR_END (nprocs);end mainprocedure Solve(A)int i, j, pid, done=0;float temp, mydiff=0;int mymin = 1 + (pid * n/procs);int mymax = mymin + n/nprocs -1;while (!done) domydiff = diff = 0;BARRIER(bar1,nprocs);for i  mymin to mymaxfor j  1 to n do…endforendforLOCK(diff_lock);diff += mydiff;UNLOCK(diff_lock);BARRIER (bar1, nprocs);if (diff < TOL) then done = 1;BARRIER (bar1, nprocs);endwhile17Message Passing Modelmain()read(n); read(nprocs);CREATE (nprocs-1, Solve);Solve();WAIT_FOR_END (nprocs-1);procedure Solve()int i, j, pid, nn = n/nprocs, done=0;float temp, tempdiff, mydiff = 0;myA  malloc(…)initialize(myA);while (!done) domydiff = 0;if (pid != 0) SEND(&myA[1,0], n, pid-1, ROW);if (pid != nprocs-1)SEND(&myA[nn,0], n, pid+1, ROW);if (pid != 0)RECEIVE(&myA[0,0], n, pid-1, ROW);if (pid != nprocs-1)RECEIVE(&myA[nn+1,0], n, pid+1, ROW);for i  1 to nn dofor j  1 to n do…endforendforif (pid != 0)SEND(mydiff, 1, 0, DIFF);RECEIVE(done, 1, 0, DONE);elsefor i  1 to nprocs-1 doRECEIVE(tempdiff, 1, *, DIFF);mydiff += tempdiff;endforif (mydiff < TOL) done = 1;for i  1 to nprocs-1 doSEND(done, 1, I, DONE);endforendifendwhile18Title•


View Full Document

U of U CS 6810 - Large Caches, Multiprocessors

Documents in this Course
Caches

Caches

13 pages

Pipelines

Pipelines

14 pages

Load more
Download Large Caches, Multiprocessors
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Large Caches, Multiprocessors and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Large Caches, Multiprocessors 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?