DOC PREVIEW
Berkeley COMPSCI 152 - Lecture 19 – Advanced Processors III

This preview shows page 1-2-3-27-28-29 out of 29 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 29 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 29 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 29 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 29 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 29 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 29 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 29 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

UC Regents Fall 2005 © UCBCS 152 L19: Advanced Processors III2005-11-3John Lazzaro (www.cs.berkeley.edu/~lazzaro)CS 152 Computer Architecture and EngineeringLecture 19 – Advanced Processors IIIwww-inst.eecs.berkeley.edu/~cs152/TAs: David Marquardt and Udam SainiUC Regents Fall 2005 © UCBCS 152 L19: Advanced Processors IIIReorderBufferInst #[...]src1 #src1 valsrc2 #src2 valdest #dest val67[...]StoreUnitTo MemoryLoadUnitFrom MemoryALU #1 ALU #2Each lineholds physical<src1, src2, dest>registersfor an instruction,and controlswhen it executesExecution engine works on the physicalregisters, not the architecture registers.Common Data Bus: <reg #, reg val>Last Time: Dynamic SchedulingUC Regents Fall 2005 © UCBCS 152 L19: Advanced Processors III Today: Throughput and multiple threadsGoal: Use multiple instruction streams to improve (1) throughput of machines that run many programs (2) execution time of multi-threaded programs.Difficulties: Gaining full advantage requires rewriting applications, OS, libraries.Ultimate limiter: Amdahl’s law (application dependent). Memory system performance.Example: Sun Niagara (32 instruction streams on a chip).UC Regents Fall 2005 © UCBCS 152 L19: Advanced Processors IIIThroughput ComputingMultithreading: Interleave instructionsfrom separate threads on the same hardware. Seen by OS as several CPUs.Multi-core: Integrating several processors that (partially) share a memory system on the same chipUC Regents Fall 2005 © UCBCS 152 L19: Advanced Processors IIIMulti-Threading(static pipelines)UC Regents Fall 2005 © UCBCS 152 L19: Advanced Processors IIIMux,LogicRecall: Bypass network prevents stallsrd1RegFilerd2WEwdrs1rs2wsExtIRIRBAM32ALU3232opIRYMIRDoutData MemoryWEDinAddrMemToRegRWE, MemToRegID (Decode) EXMEMWBFrom WBInstead of bypass: interleave threads on the pipeline to prevent stalls ...UC Regents Fall 2005 © UCBCS 152 L19: Advanced Processors IIIKrsteNovember 10, 20046.823, L18--3MultithreadingHow can we guarantee no dependencies between instructions in a pipeline?-- One way is to interleave execution of instructions from different program threads on same pipelineF D X M Wt0 t1 t2 t3 t4 t5 t6 t7 t8T1: LW r1, 0(r2)T2: ADD r7, r1, r4T3: XORI r5, r4, #12T4: SW 0(r7), r5T1: LW r5, 12(r1)t9F D X M WF D X M WF D X M WF D X M WInterleave 4 threads, T1-T4, on non-bypassed 5-stage pipeLast instruction in a thread always completes writeback before next instruction in same thread reads regfileKrsteNovember 10, 20046.823, L18--5Simple Multithreaded PipelineHave to carry thread select down pipeline to ensure correct state bits read/written at each pipe stage+12Thread selectPC1PC1PC1PC1I$IRGPR1GPR1GPR1GPR1XY2D$Introduced in 1964 by Seymour Cray4 CPUs,each run at 1/4 clockMany variants ...UC Regents Fall 2005 © UCBCS 152 L19: Advanced Processors IIIMulti-Threading(dynamic scheduling)UC Regents Fall 2005 © UCBCS 152 L19: Advanced Processors IIIPower 4 (predates Power 5 shown Tuesday)●Load hit store: A younger load that executes before anolder store to the same memory location has written itsdata to the caches must retrieve the data from the SDQ.As loads execute, they check the SRQ to see whetherthere is any older store to the same memory locationwith data in the SDQ. If one is found, the data isforwarded from the SDQ rather than from the cache. Ifthe data cannot be forwarded (as is the case if the loadand store instructions operate on overlapping memorylocations and the load data is not the same as orcontained within the store data), the group containingthe load instruction is flushed; that is, it and all youngergroups are discarded and refetched from the instructioncache. If we can tell that there is an older storeinstruction that will write to the same memory locationbut has yet to write its result to the SDQ, the loadinstruction is rejected and reissued, again waiting forthe store instruction to execute.●Store hit load: If a younger load instruction executesbefore we have had a chance to recognize that an olderstore will be writing to the same memory location, theload instruction has received stale data. To guardagainst this, as a store instruction executes it checks theLRQ; if it finds a younger load that has executed andloaded from memory locations to which the store iswriting, the group containing the load instruction andall younger groups are flushed and refetched from theinstruction cache. To simplify the logic, all groupsfollowing the store are flushed. If the offending load isin the same group as the store instruction, the group isflushed, and all instructions in the group form single-instruction groups.●Load hit load: Two loads to the same memory locationmust observe the memory reference order and preventa store to the memory location from another processorbetween the intervening loads. If the younger loadobtains old data, the older load must not obtainnew data. This requirement is called sequential loadconsistency. To guard against this, LRQ entries for allloads include a bit which, if set, indicates that a snoophas occurred to the line containing the loaded datafor that entry. When a load instruction executes, itcompares its load address against all addresses in theLRQ. A match against a younger entry which has beensnooped indicates that a sequential load consistencyproblem exists. To simplify the logic, all groupsfollowing the older load instruction are flushed. If bothload instructions are in the same group, the flushrequest is for the group itself. In this case, eachinstruction in the group when refetched forms a single-instruction group in order to avoid this situation thesecond time around.Instruction execution pipelineFigure 4 shows the POWER4 instruction executionpipeline for the various pipelines. The IF, IC, and BPcycles correspond to the instruction-fetching and branch-prediction cycles. The D0 through GD cycles are thecycles during which instruction decode and groupformation occur. The MP cycle is the mapper cycle,in which all dependencies are determined, resourcesassigned, and the group dispatched into the appropriateissue queues. During the ISS cycle, the IOP is issued tothe appropriate execution unit, reads the appropriateFigure 4POWER4 instruction execution pipeline.EA DC WBMP ISS RF EX WBMP ISS RFMP ISS RFF6MP ISS RFCPLD/STFXFPWBFmtD0ICIFBPEXD1 D2 D3 XferXferXferGDBranch redirectsInstruction fetchXferXferBRWBOut-of-order processingInstruction crack andgroup formationInterrupts and flushesIBM J. RES. & DEV. VOL.


View Full Document

Berkeley COMPSCI 152 - Lecture 19 – Advanced Processors III

Documents in this Course
Quiz 5

Quiz 5

9 pages

Memory

Memory

29 pages

Quiz 5

Quiz 5

15 pages

Memory

Memory

29 pages

Memory

Memory

35 pages

Memory

Memory

15 pages

Quiz

Quiz

6 pages

Midterm 1

Midterm 1

20 pages

Quiz

Quiz

12 pages

Memory

Memory

33 pages

Quiz

Quiz

6 pages

Homework

Homework

19 pages

Quiz

Quiz

5 pages

Memory

Memory

15 pages

Load more
Download Lecture 19 – Advanced Processors III
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture 19 – Advanced Processors III and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture 19 – Advanced Processors III 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?