DOC PREVIEW
Berkeley COMPSCI 152 - Lecture 19 – Advanced Processors III

This preview shows page 1-2-3-19-20-38-39-40 out of 40 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 40 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 40 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 40 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 40 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 40 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 40 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 40 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 40 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 40 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

CS 152 L19: Advanced Processors III UC Regents Fall 2006 © UCB2006-11-2John Lazzaro (www.cs.berkeley.edu/~lazzaro)CS 152 Computer Architecture and EngineeringLecture 19 – Advanced Processors IIIwww-inst.eecs.berkeley.edu/~cs152/TAs: Udam Saini and Jue Sun 1UC Regents Fall 2006 © UCBCS 152 L18: Advanced Processors IIThe Power5 scans fetched instructions forbranches (BP stage), and if it finds a branch,predicts the branch direction using threebranch history tables shared by the twothreads. Two of the BHTs use bimodal andpath-correlated branch prediction mecha-nisms to predict branch directions.6,7Thethird BHT predicts which of these predictionmechanisms is more likely to predict the cor-rect direction.7If the fetched instructions con-tain multiple branches, the BP stage can pre-dict all the branches at the same time. Inaddition to predicting direction, the Power5also predicts the target of a taken branch inthe current cycle’s eight-instruction group. Inthe PowerPC architecture, the processor cancalculate the target of most branches from theinstruction’s address and offset value. For43MARCH–APRIL 2004MP ISS RF EA DC WB XferMP ISS RF EX WB XferMP ISS RF EX WB XferMP ISS RFXferF6Group formation andinstruction decodeInstruction fetchBranch redirectsInterrupts and flushesWBFmtD1 D2 D3 Xfer GDBPICCPD0IFBranchpipelineLoad/storepipelineFixed-pointpipelineFloating-point pipelineOut-of-order processingFigure 3. Power5 instruction pipeline (IF = instruction fetch, IC = instruction cache, BP = branch predict, D0 = decode stage0, Xfer = transfer, GD = group dispatch, MP = mapping, ISS = instruction issue, RF = register file read, EX = execute, EA =compute address, DC = data caches, F6 = six-cycle floating-point execution pipe, Fmt = data format, WB = write back, andCP = group commit).Shared by two threads Thread 0 resources Thread 1 resourcesLSU0FXU0LSU1FXU1FPU0FPU1BXUCRLDynamicinstructionselectionThreadpriorityGroup formationInstruction decodeDispatchShared-registermappersReadshared-register filesSharedissuequeuesSharedexecutionunitsAlternateBranch prediction InstructioncacheInstructiontranslationProgramcounterBranchhistorytablesReturnstackTargetcacheDataCacheDataTranslationL2cacheDatacacheDatatranslationInstructionbuffer 0Instructionbuffer 1Writeshared- register filesGroupcompletionStorequeueFigure 4. Power5 instruction data flow (BXU = branch execution unit and CRL = condition register logical execution unit).Fetch up to 8 instructions per cycle.Dispatch up to 5 instructions per cycle240 physical registers (120 int + 120 FP)Execute up to 8 instructions per cycleA thread may commit up to 5 instructions per cycle.Up to 200 instructions “in flight”.Last Time: Dynamic Scheduling2UC Regents Fall 2006 © UCBCS 152 L19: Advanced Processors III Today: Throughput and multiple threadsGoal: Use multiple instruction streams to improve (1) throughput of machines that run many programs (2) multi-threaded program execution time.Difficulties: Gaining full advantage requires rewriting applications, OS, libraries.Ultimate limiter: Amdahl’s law (application dependent). Memory system performance.Example: Sun Niagara (32 instruction streams on a chip). 3UC Regents Fall 2006 © UCBCS 152 L19: Advanced Processors IIIThroughput ComputingMultithreading: Interleave instructionsfrom separate threads on the same hardware. Seen by OS as several CPUs.Multi-core: Integrating several processors that (partially) share a memory system on the same chip4CS 152 L19: Advanced Processors III UC Regents Fall 2006 © UCBMulti-Threading(Static Pipelines)5UC Regents Fall 2006 © UCBCS 152 L19: Advanced Processors IIIMux,LogicRecall: Bypass network prevents stallsrd1RegFilerd2WEwdrs1rs2wsExtIRIRBAM32ALU3232opIRYMIRDoutData MemoryWEDinAddrMemToRegRWE, MemToRegID (Decode) EXMEMWBFrom WBInstead of bypass: Interleave threads on the pipeline to prevent stalls ...6UC Regents Fall 2006 © UCBCS 152 L19: Advanced Processors IIIKrsteNovember 10, 20046.823, L18--3MultithreadingHow can we guarantee no dependencies between instructions in a pipeline?-- One way is to interleave execution of instructions from different program threads on same pipelineF D X M Wt0 t1 t2 t3 t4 t5 t6 t7 t8T1: LW r1, 0(r2)T2: ADD r7, r1, r4T3: XORI r5, r4, #12T4: SW 0(r7), r5T1: LW r5, 12(r1)t9F D X M WF D X M WF D X M WF D X M WInterleave 4 threads, T1-T4, on non-bypassed 5-stage pipeLast instruction in a thread always completes writeback before next instruction in same thread reads regfileKrsteNovember 10, 20046.823, L18--5Simple Multithreaded PipelineHave to carry thread select down pipeline to ensure correct state bits read/written at each pipe stage+12Thread selectPC1PC1PC1PC1I$IRGPR1GPR1GPR1GPR1XY2D$Introduced in 1964 by Seymour Cray4 CPUs,each run at 1/4 clockMany variants ...7CS 152 L19: Advanced Processors III UC Regents Fall 2006 © UCBMulti-Threading(Dynamic Scheduling)8UC Regents Fall 2006 © UCBCS 152 L19: Advanced Processors IIIPower 4 (predates Power 5 shown Tuesday)●Load hit store: A younger load that executes before anolder store to the same memory location has written itsdata to the c aches must retrieve the data from the SDQ.As loads execute, they check the SRQ to see whetherthere is any older store to the s ame memory locationwith data in the S DQ. If one is found, the data isforwarded from the SDQ rather than from the cache. Ifthe data cannot be forwarded (as is the case if the l oadand store instructions operat e on overlapping memorylocations and the load data is not the same as o rcontained within the store data), the group conta iningthe load instruction is flushed; that is, it and all youngergroups are discarded and refetched from the instr uctioncache. If we can t ell that there is an older storeinstruction that will write to the same memory locatio nbut has yet t o write its result to the SDQ, the loadinstruction is rejected and reissued, again waiting forthe store instruction to execute.●Store hit load: If a yo unger load instruction executesbefore we have had a ch ance to recognize that an olderstore will be writing to the same memory location, theload instruction has received stal e data. To guardagainst this, as a store instruction execute s it checks theLRQ; if it fin ds a younger load that has executed andloaded from memory locations to which the store iswriting, the group containing the load instr uction andall younger groups are flushed and refetched from theinstruction cache. To


View Full Document

Berkeley COMPSCI 152 - Lecture 19 – Advanced Processors III

Documents in this Course
Quiz 5

Quiz 5

9 pages

Memory

Memory

29 pages

Quiz 5

Quiz 5

15 pages

Memory

Memory

29 pages

Memory

Memory

35 pages

Memory

Memory

15 pages

Quiz

Quiz

6 pages

Midterm 1

Midterm 1

20 pages

Quiz

Quiz

12 pages

Memory

Memory

33 pages

Quiz

Quiz

6 pages

Homework

Homework

19 pages

Quiz

Quiz

5 pages

Memory

Memory

15 pages

Load more
Download Lecture 19 – Advanced Processors III
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture 19 – Advanced Processors III and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture 19 – Advanced Processors III 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?