Unformatted text preview:

CS 152 Computer Architecture and Engineering Lecture 24 Multiprocessors 2005 11 22 John Lazzaro www cs berkeley edu lazzaro TAs David Marquardt and Udam Saini www inst eecs berkeley edu cs152 CS 152 L24 Multiprocessors UC Regents Fall 2005 UCB Congratulations All groups had all but one of the test programs running on their processor in hardware by midnight 3 groups had all of the test programs running on their processor in hardware by midnight 1 group passed checkoff on their first try on Friday in section More on the project later in lecture CS 152 L24 Multiprocessors UC Regents Fall 2005 UCB Last Time Synchronization Tail Before Head y x Tail After Higher Addresses T1 code producer ORi R1 R0 x LW R2 tail R0 SW R1 0 R2 ADDi R2 R2 4 SW R2 0 tail LW R3 head R0 spin LW R4 tail R0 BEQ R4 R3 spin T2 T3 LW R5 0 R3 2 copes ADDi R3 R3 4 of consumer SW R3 head R0 thread Head y Higher Addresses Load x value into R1 Load queue tail into R2 Store x into queue Shift tail by one word Update tail memory addr Load queue head into R3 Load queue tail into R4 If queue empty wait Read x from queue into R5 Shift head by one word Update head memory addr Critical section T2 and T3 must take turns running red code CS 152 L24 Multiprocessors UC Regents Fall 2005 UCB Today Memory System Design Multiprocessor memory systems Consequences of cache placement Write through cache coherency Simple but limited approach to multiprocessor memory systems NUMA and Clusters Two different ways to build very large computers CS 152 L24 Multiprocessors UC Regents Fall 2005 UCB Two CPUs two caches shared DRAM CPU0 LW R2 16 R0 CPU1 CPU0 CPU1 LW R2 16 R0 Cache Cache Addr Value Addr 16 5 16 Value Shared Main Memory Addr 16 Value 5 0 Write through caches CS 152 L24 Multiprocessors 50 CPU1 SW R0 16 R0 View of memory no longer coherent Loads of location 16 from CPU0 and CPU1 see different values Today What to do UC Regents Fall 2005 UCB The simplest solution one cache CPU1 CPU0 Memory Switch Shared Multi Bank Cache Shared Main Memory CPUs do not have internal caches Only one cache so different values for a memory address cannot appear in 2 caches Multiple caches banks support read writes by both CPUs in a switch epoch unless both target same bank In that case one request is stalled CS 152 L24 Multiprocessors UC Regents Fall 2005 UCB Not a complete solution good for L2 CPU1 CPU0 Memory Switch Shared Multi Bank Cache Shared Main Memory CS 152 L24 Multiprocessors For modern clock rates access to shared cache through switch takes 10 cycles Using shared cache as the L1 data cache is tantamount to slowing down clock 10X for LWs Not good This approach was a complete solution in the days when DRAM row access time and CPU clock period were well matched UC Regents Fall 2005 UCB Modified form Private L1s shared L2 CPU0 CPU1 L1 Caches L1 Caches Memory Switch or Bus Shared Multi Bank L2 Cache Shared Main Memory CS 152 L24 Multiprocessors Thus we need to solve the cache coherency problem for L1 cache Advantages of shared L2 over private L2s Processors communicate at cache speed not DRAM speed Constructive interference if both CPUs need same data instr Disadvantage CPUs share BW to L2 cache UC Regents Fall 2005 UCB OT CHIPS 15 CS 152 L24 Multiprocessors igure 2 Power5 chip FXU fixed point execution unit ISU instruction sequencing unit IDU instruction decode unit SU load store unit IFU instruction fetch unit FPU UC Regents Fall 2005 UCB Sequentially Consistent Memory Systems CS 152 L24 Multiprocessors UC Regents Fall 2005 UCB Recall Sequential Consistency Sequential Consistency As if each thread takes turns executing and instructions in each thread execute in program order T1 code producer ORi R1 R0 x LW R2 tail R0 SW R1 0 R2 1 ADDi R2 R2 4 SW R2 0 tail 2 LW R3 head R0 spin LW R4 tail R0 3 BEQ R4 R3 spin LW R5 0 R3 T2 code 4 consumer ADDi R3 R3 4 SW R3 head R0 Load x value into R1 Load queue tail into R2 Store x into queue Shift tail by one word Update tail memory addr Load queue head into R3 Load queue tail into R4 If queue empty wait Read x from queue into R5 Shift head by one word Update head memory addr Legal orders 1 2 3 4 or 1 3 2 4 or 3 4 1 2 but not 2 3 1 4 CS 152 L24 Multiprocessors UC Regents Fall 2005 UCB Sequential consistency requirements 1 Only one processor at a time has write permission for a memory location CPU0 CPU1 Cache Cache Addr Value Addr 16 5 16 Value Shared Memory Hierarchy Addr 16 CS 152 L24 Multiprocessors Value 5 0 50 The sequential part of sequential consistency 2 No processor can load a stale copy of a location after a write The consistent part of sequential consistency UC Regents Fall 2005 UCB Implementation Snoopy Caches CPU1 CPU0 Cache Snooper Cache Snooper Memory bus Shared Main Memory Hierarchy Each cache has the ability to snoop on memory bus transactions of other CPUs The bus also has mechanisms to let a CPU inter vene to stop a bus transaction and to invalidate cache lines of other CPUs CS 152 L24 Multiprocessors UC Regents Fall 2005 UCB Writes from 10 000 feet For write thru caches CPU1 CPU0 Cache 1 Writing CPU takes control of bus Snooper Cache Snooper Memory bus Shared Main Memory Hierarchy To a first order reads will just work if write thru caches implement this policy A t wo state protocol cache lines are valid or invalid CS 152 L24 Multiprocessors 2 Address to be written is invalidated in all other caches Reads will no longer hit in cache and get stale data 3 Write is sent to main memory Reads will cache miss retrieve new value from main memory UC Regents Fall 2005 UCB Limitations of the write thru approach CPU1 CPU0 Cache Snooper Cache Snooper Memory bus Shared Main Memory Hierarchy Write back big trick keep track of whether other caches also contain a cached line If not a cache has an exclusive on the line and can read and write the line as if it were the only CPU For details take CS 252 CS 152 L24 Multiprocessors Every write goes to the bus Total bus write bandwidth does not support more than 2 CPUs in modern practice To scale further we need to use write back caches UC Regents Fall 2005 UCB Other Machine Architectures CS 152 L24 Multiprocessors UC Regents Fall 2005 UCB NUMA Non uniform Memory Access CPU 0 CPU 1023 Cache Cache DRAM DRAM Interconnection Network Good for applications that match the machine model CS 152 L24 Multiprocessors Each CPU has part of main memory attached to it To access other parts of main memory use the interconnection network For best results …


View Full Document

Berkeley COMPSCI 152 - Lecture 24 – Multiprocessors

Documents in this Course
Quiz 5

Quiz 5

9 pages

Memory

Memory

29 pages

Quiz 5

Quiz 5

15 pages

Memory

Memory

29 pages

Memory

Memory

35 pages

Memory

Memory

15 pages

Quiz

Quiz

6 pages

Midterm 1

Midterm 1

20 pages

Quiz

Quiz

12 pages

Memory

Memory

33 pages

Quiz

Quiz

6 pages

Homework

Homework

19 pages

Quiz

Quiz

5 pages

Memory

Memory

15 pages

Load more
Loading Unlocking...
Login

Join to view Lecture 24 – Multiprocessors and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture 24 – Multiprocessors and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?