DOC PREVIEW
Berkeley COMPSCI 152 - Lecture 24 – Multiprocessors

This preview shows page 1-2-21-22 out of 22 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 22 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 22 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 22 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 22 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 22 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

UC Regents Fall 2005 © UCBCS 152 L24: Multiprocessors2005-11-22John Lazzaro (www.cs.berkeley.edu/~lazzaro)CS 152 Computer Architecture and EngineeringLecture 24 – Multiprocessorswww-inst.eecs.berkeley.edu/~cs152/TAs: David Marquardt and Udam SainiUC Regents Fall 2005 © UCBCS 152 L24: MultiprocessorsCongratulations!All groups had all but one of the test programs running on their processor in hardware by midnight.3 groups had all of the test programs running on their processor in hardware by midnight.1 group passed checkoff on their first try on Friday in section.More on the project later in lecture ...UC Regents Fall 2005 © UCBCS 152 L24: MultiprocessorsLast Time: Synchronization Higher Addresses LW R3, head(R0) ; Load queue head into R3spin: LW R4, tail(R0) ; Load queue tail into R4 BEQ R4, R3, spin ; If queue empty, wait LW R5, 0(R3) ; Read x from queue into R5 ADDi R3, R3, 4 ; Shift head by one word SW R3 head(R0) ; Update head memory addrT2 & T3 (2 copes of consumer thread)yxTail HeadyTail HeadAfter:Before:Higher AddressesT1 code(producer)ORi R1, R0, x ; Load x value into R1LW R2, tail(R0) ; Load queue tail into R2 SW R1, 0(R2) ; Store x into queueADDi R2, R2, 4 ; Shift tail by one wordSW R2 0(tail) ; Update tail memory addrCritical section: T2 and T3 must take turns running red code.UC Regents Fall 2005 © UCBCS 152 L24: MultiprocessorsToday: Memory System DesignNUMA and Clusters: Two different ways to build very large computers.Multiprocessor memory systems: Consequences of cache placement.Write-through cache coherency: Simple, but limited, approach to multiprocessor memory systems.UC Regents Fall 2005 © UCBCS 152 L24: MultiprocessorsTwo CPUs, two caches, shared DRAM ...CPU0CacheAddrValueCPU1Shared Main MemoryAddrValue16CacheAddrValue5CPU0:LW R2, 16(R0)516CPU1:LW R2, 16(R0)16 5CPU1:SW R0,16(R0)00Write-through cachesView of memory no longer “coherent”.Loads of location 16 from CPU0 and CPU1 see different values!Today: What to do ...UC Regents Fall 2005 © UCBCS 152 L24: MultiprocessorsThe simplest solution ... one cache!CPU0CPU1Shared Main MemoryCPUs do not have internal caches.Only one cache, so different values for a memory address cannot appear in 2caches!Shared Multi-Bank CacheMemory SwitchMultiple caches banks support read/writes by both CPUs in a switch epoch, unless both target same bank.In that case, one request is stalled.UC Regents Fall 2005 © UCBCS 152 L24: MultiprocessorsNot a complete solution ... good for L2.CPU0CPU1Shared Main MemoryFor modern clock rates,access to shared cache through switch takes 10+ cycles.Shared Multi-Bank CacheMemory SwitchUsing shared cache as the L1 data cache is tantamount to slowing down clock 10X for LWs. Not good.This approach was a complete solution in the days when DRAM row access time and CPU clock period were well matched.UC Regents Fall 2005 © UCBCS 152 L24: MultiprocessorsModified form: Private L1s, shared L2CPU0CPU1Shared Main MemoryThus, we need to solve the cache coherency problem for L1 cache.Shared Multi-Bank L2 CacheMemory Switch or BusAdvantages of shared L2 over private L2s:Processors communicate at cache speed, not DRAM speed.L1 Caches L1 CachesConstructive interference, if both CPUs need same data/instr.Disadvantage: CPUs share BW to L2 cache ...UC Regents Fall 2005 © UCBCS 152 L24: Multiprocessorssupports a 1.875-Mbyte on-chip L2 cache.Power4 and Power4+ systems both have 32-Mbyte L3 caches, whereas Power5 systemshave a 36-Mbyte L3 cache.The L3 cache operates as a backdoor withseparate buses for reads and writes that oper-ate at half processor speed. In Power4 andPower4+ systems, the L3 was an inline cachefor data retrieved from memory. Because ofthe higher transistor density of the Power5’s130-nm technology, we could move the mem-ory controller on chip and eliminate a chippreviously needed for the memory controllerfunction. These two changes in the Power5also have the significant side benefits of reduc-ing latency to the L3 cache and main memo-ry, as well as reducing the number of chipsnecessary to build a system.Chip overviewFigure 2 shows the Power5 chip, whichIBM fabricates using silicon-on-insulator(SOI) devices and copper interconnect. SOItechnology reduces device capacitance toincrease transistor performance.5Copperinterconnect decreases wire resistance andreduces delays in wire-dominated chip-tim-ing paths. In 130 nm lithography, the chipuses eight metal levels and measures 389 mm2.The Power5 processor supports the 64-bitPowerPC architecture. A single die containstwo identical processor cores, each supportingtwo logical threads. This architecture makesthe chip appear as a four-way symmetric mul-tiprocessor to the operating system. The twocores share a 1.875-Mbyte (1,920-Kbyte) L2cache. We implemented the L2 cache as threeidentical slices with separate controllers foreach. The L2 slices are 10-way set-associativewith 512 congruence classes of 128-byte lines.The data’s real address determines which L2slice the data is cached in. Either processor corecan independently access each L2 controller.We also integrated the directory for an off-chip 36-Mbyte L3 cache on the Power5 chip.Having the L3 cache directory on chip allowsthe processor to check the directory after anL2 miss without experiencing off-chip delays.To reduce memory latencies, we integratedthe memory controller on the chip. This elim-inates driver and receiver delays to an exter-nal controller.Processor coreWe designed the Power5 processor core tosupport both enhanced SMT and single-threaded (ST) operation modes. Figure 3shows the Power5’s instruction pipeline,which is identical to the Power4’s. All pipelinelatencies in the Power5, including the branchmisprediction penalty and load-to-use laten-cy with an L1 data cache hit, are the same asin the Power4. The identical pipeline struc-ture lets optimizations designed for Power4-based systems perform equally well onPower5-based systems. Figure 4 shows thePower5’s instruction flow diagram.In SMT mode, the Power5 uses two sepa-rate instruction fetch address registers to storethe program counters for the two threads.Instruction fetches (IF stage) alternatebetween the two threads. In ST mode, thePower5 uses only one program counter andcan fetch instructions for that thread everycycle. It can fetch up to eight instructionsfrom the instruction cache (IC stage) everycycle. The two threads share


View Full Document

Berkeley COMPSCI 152 - Lecture 24 – Multiprocessors

Documents in this Course
Quiz 5

Quiz 5

9 pages

Memory

Memory

29 pages

Quiz 5

Quiz 5

15 pages

Memory

Memory

29 pages

Memory

Memory

35 pages

Memory

Memory

15 pages

Quiz

Quiz

6 pages

Midterm 1

Midterm 1

20 pages

Quiz

Quiz

12 pages

Memory

Memory

33 pages

Quiz

Quiz

6 pages

Homework

Homework

19 pages

Quiz

Quiz

5 pages

Memory

Memory

15 pages

Load more
Download Lecture 24 – Multiprocessors
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture 24 – Multiprocessors and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture 24 – Multiprocessors 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?