DOC PREVIEW
Berkeley COMPSCI 152 - Lecture 22 – Advanced Processors III

This preview shows page 1-2-3-25-26-27 out of 27 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 27 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 27 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 27 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 27 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 27 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 27 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 27 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

UC Regents Fall 2004 © UCBCS 152 L22: Advanced Processors III2004-11-18 Dave Patterson(www.cs.berkeley.edu/~patterson)John Lazzaro (www.cs.berkeley.edu/~lazzaro)CS 152 Computer Architecture and EngineeringLecture 22 – Advanced Processors IIIwww-inst.eecs.berkeley.edu/~cs152/1UC Regents Fall 2004 © UCBCS 152 L22: Advanced Processors IIILast Time: Dynamic SchedulingReorderBufferInst #[...]src1 #src1 valsrc2 #src2 valdest #dest val67[...]StoreUnitTo MemoryLoadUnitFrom MemoryALU #1 ALU #2Each lineholds physical<src1, src2, dest>registersfor an instruction,and controlswhen it executesExecution engine works on the physicalregisters, not the architecture registers.Common Data Bus: <dest #, dest val>2UC Regents Fall 2004 © UCBCS 152 L22: Advanced Processors III Recall: Throughput and multiple threadsGoal: Use multiple instruction streams to improve (1) throughput of machines that run many programs (2) execution time of multi-threaded programs.Difficulties: Gaining full advantage requires rewriting applications, OS, libraries.Ultimate limiter: Amdahl’s law (application dependent). Memory system performance.Example: Sun Niagara (32 instruction streams on a chip). 3UC Regents Fall 2004 © UCBCS 152 L22: Advanced Processors IIIThis Time: Throughput ComputingMultithreading: Interleave instructionsfrom separate threads on the same hardware. Seen by OS as several CPUs.Multi-core: Integrating several processors that (partially) share a memory system on the same chipAlso: A “town meeting” discussion on lessons learned from Lab 4.4UC Regents Fall 2004 © UCBCS 152 L22: Advanced Processors IIIMulti-Threading5UC Regents Fall 2004 © UCBCS 152 L22: Advanced Processors IIIPower 4 (predates Power 5 shown Tuesday)●Load hit store: A younger load that executes before anolder store to the same memory location has written itsdata to the caches must retrieve the data from the SDQ.As loads execute, they check the SRQ to see whetherthere is any older store to the same memory locationwith data in the SDQ. If one is found, the data isforwarded from the SDQ rather than from the cache. Ifthe data cannot be forwarded (as is the case if the loadand store instructions operate on overlapping memorylocations and the load data is not the same as orcontained within the store data), the group containingthe load instruction is flushed; that is, it and all youngergroups are discarded and refetched from the instructioncache. If we can tell that there is an older storeinstruction that will write to the same memory locationbut has yet to write its result to the SDQ, the loadinstruction is rejected and reissued, again waiting forthe store instruction to execute.●Store hit load: If a younger load instruction executesbefore we have had a chance to recognize that an olderstore will be writing to the same memory location, theload instruction has received stale data. To guardagainst this, as a store instruction executes it checks theLRQ; if it finds a younger load that has executed andloaded from memory locations to which the store iswriting, the group containing the load instruction andall younger groups are flushed and refetched from theinstruction cache. To simplify the logic, all groupsfollowing the store are fl ushed. If the offending load isin the same group as the store instruction, the group isflushed, and all instructions in the group form single-instruction groups.●Load hit load: Two loads to the same memory locationmust observe the memory reference order and preventa store to the memory location from another processorbetween the intervening loads. If the younger loadobtains old data, the older load must not obtainnew data. This requirement is called sequential loadconsistency. To guard against this, LRQ entries for allloads include a bit which, if set, indicates that a snoophas occurred to the line containing the loaded datafor that entry. When a load instruction executes, itcompares its load address against all addresses in theLRQ. A match against a younger entry which has beensnooped indicates that a sequential load consistencyproblem exists. To simplify the logic, all groupsfollowing the older load instruction are fl ushed. If bothload instructions are in the same group, the fl ushrequest is for the group itself. In this case, eachinstruction in the group when refetched forms a single-instruction group in order to avoid this situation thesecond time around.Instruction execution pipelineFigure 4 shows the POWER4 instruction executionpipeline for the various pipelines. The IF, IC, and BPcycles correspond to the instruction-fetching and branch-prediction cycles. The D0 through GD cycles are thecycles during which instruction decode and groupformation occur. The MP cycle is the mapper cycle,in which all dependencies are determined, resourcesassigned, and the group dispatched into the appropriateissue queues. During the ISS cycle, the IOP is issued tothe appropriate execution unit, reads the appropriateFigure 4POWER4 instruction execution pipeline.EA DC WBMP ISS RF EX WBMP ISS RFMP ISS RFF6MP ISS RFCPLD/STFXFPWBFmtD0ICIFBPEXD1 D2 D3 XferXferXferGDBranch redirectsInstruction fetchXferXferBRWBOut-of-order processingInstruction crack andgroup formationInterrupts and flushesIBM J. RES. & DEV. VOL. 46 NO. 1 JANUARY 2002 J. M. TENDLER ET AL.13Single-threaded predecessor to Power 5. 8 execution units inout-of-order engine, each mayissue an instruction each cycle.6UC Regents Fall 2004 © UCBCS 152 L22: Advanced Processors IIIFor most apps, most execution units lie idleApplicationsalvinndoduceqntottespressofpppphydro2dlimdljdp2mdljsp2nasa7orasu2corswmtomcatv1009080706050403020100compositeitlb missdtlb missdcache missprocessor busyicache missbranch mispredictioncontrol hazardsload delaysshort integerlong integershort fplong fpmemory conflictPercent of Total Issue CyclesFigure 2: Sources of all unused issue cycles in an 8-issue superscalar processor. Processor busy represents the utilized issue slots; allothers represent wasted issue slots.such as an I tlb miss and an I cache miss, the wasted cycles aredivided up appropriately. Table 3 specifies all possible sourcesof wasted cycles in our model, and some of the latency-hiding orlatency-reducing techniques that might apply to them. Previouswork [32, 5, 18], in contrast, quantified some of these same effectsby removing barriers to parallelism and measuring the resultingincreases in performance.Our results, shown in Figure 2, demonstrate that the


View Full Document

Berkeley COMPSCI 152 - Lecture 22 – Advanced Processors III

Documents in this Course
Quiz 5

Quiz 5

9 pages

Memory

Memory

29 pages

Quiz 5

Quiz 5

15 pages

Memory

Memory

29 pages

Memory

Memory

35 pages

Memory

Memory

15 pages

Quiz

Quiz

6 pages

Midterm 1

Midterm 1

20 pages

Quiz

Quiz

12 pages

Memory

Memory

33 pages

Quiz

Quiz

6 pages

Homework

Homework

19 pages

Quiz

Quiz

5 pages

Memory

Memory

15 pages

Load more
Download Lecture 22 – Advanced Processors III
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture 22 – Advanced Processors III and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture 22 – Advanced Processors III 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?