DOC PREVIEW
Berkeley COMPSCI 152 - Lecture 16 Dynamic Scheduling: Scoreboards and Tomasulo

This preview shows page 1-2-3-4-5-38-39-40-41-42-43-76-77-78-79-80 out of 80 pages.

Save
View full document
Premium Document
Do you want full access? Go Premium and unlock all 80 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

CS152 Computer Architecture and Engineering Lecture 16 Dynamic Scheduling Scoreboards and Tomasulo April 2 2003 John Kubiatowicz www cs berkeley edu kubitron lecture slides http inst eecs berkeley edu cs152 4 02 03 UCB Spring 2003 CS152 Kubiatowicz The Big Picture Where are We Now The Five Classic Components of a Computer Processor Input Control Memory Datapath Output Today s Topics Recap last lecture Review Scoreboard Administrivia Tomasulo scheduling algorithm Tomasulo loop unrolling 4 02 03 UCB Spring 2003 CS152 Kubiatowicz Recall Compiler techniques for parallelism Loop unrolling Multiple iterations of loop in software Amortizes loop overhead over several iterations Gives more opportunity for scheduling around stalls Software Pipelining Take one instruction from each of several iterations of the loop Software overlapping of loop iterations Today will show hardware overlapping of loop iterations Very Long Instruction Word machines VLIW Multiple operations coded in single long instruction Requires sophisticated compiler to decide which operations can be done in parallel Trace scheduling find common path and schedule code as if branches didn t exist add fixup code All of these require additional registers 4 02 03 UCB Spring 2003 CS152 Kubiatowicz Recall Can we somehow make CPI closer to 1 Let s assume full pipelining If we have a 4 cycle instruction then we need 3 instructions between a producing instruction and its use multf delay 1 delay 2 delay 3 addf F0 F2 F4 delay 1 delay 2 sw F6 F10 F0 multf F0 F2 F4 delay 1 multf F4 F0 F3 F0 4 R2 ld F0 0 r5 Earliest forwarding for 4 cycle instructions Earliest forwarding for 1 cycle instructions Fetch Decode Ex1 Ex2 Ex3 Ex4 WB addf delay3 delay2 delay1 multf 4 02 03 UCB Spring 2003 CS152 Kubiatowicz Recall Revised FP Loop Minimizing Stalls 1 Loop LD F0 0 R1 2 stall 3 ADDD F4 F0 F2 4 SUBI R1 R1 8 5 BNEZ R1 Loop 6 SD 8 R1 F4 delayed branch altered when move past SUBI Swap BNEZ and SD by changing address of SD Instruction Execute Instruction Use Latency producing resultLatency FP ALU op 4 FP ALU op 4 Load double 2 using result in cycles Another FP ALU op 3 Store double 2 FP ALU op 1 6 clocks CPI 6 5 1 2 Unroll loop 4 times code to make faster 4 02 03 UCB Spring 2003 CS152 Kubiatowicz Recall Unrolled Loop That Minimizes Stalls 1 Loop LD 2 LD 3 LD 4 LD 5 ADDD 6 ADDD 7 ADDD 8 ADDD 9 SD 10 SD 11 SD 12 SUBI 13 BNEZ 14 SD F0 0 R1 F6 8 R1 F10 16 R1 F14 24 R1 F4 F0 F2 F8 F6 F2 F12 F10 F2 F16 F14 F2 0 R1 F4 8 R1 F8 16 R1 F12 R1 R1 32 R1 LOOP 8 R1 F16 What assumptions made when moved code OK to move store past SUBI even though changes register OK to move loads before stores get right data When is it safe for compiler to do such changes 8 32 24 14 clock cycles or 3 5 per iteration CPI 14 14 1 When safe to move instructions 4 02 03 UCB Spring 2003 CS152 Kubiatowicz Recall Software Pipelining Example After Software Pipelined 1 2 3 4 5 Symbolic Loop Unrolling SD ADDD LD SUBI BNEZ 0 R1 F4 Stores M i F4 F0 F2 Adds to M i 1 F0 16 R1 Loads M i 2 R1 R1 8 R1 LOOP overlapped ops Before Unrolled 3 times 1 LD F0 0 R1 2 ADDD F4 F0 F2 3 SD 0 R1 F4 4 LD F6 8 R1 5 ADDD F8 F6 F2 6 SD 8 R1 F8 7 LD F10 16 R1 8 ADDD F12 F10 F2 9 SD 16 R1 F12 10 SUBI R1 R1 24 11 BNEZ R1 LOOP SW Pipeline Time Loop Unrolled Maximize result use distance Less code space than unrolling Time Fill drain pipe only once per loop vs once per each unrolled iteration in loop unrolling 4 02 03 UCB Spring 2003 CS152 Kubiatowicz Recall Software Pipelining with Loop Unrolling in VLIW Memory reference 1 Memory FP reference 2 FP Int op Clock operation 1 op 2 branch LD F0 48 R1 LD F6 56 R1 LD F10 40 R1 ST 0 R1 F4 ST 8 R1 F8 ST 8 R1 F12 ADDD F4 F0 F2 1 ADDD F8 F6 F2 SUBI R1 R1 24 ADDD F12 F10 F2 BNEZ R1 LOOP 2 3 Software pipelined across 9 iterations of original loop In each iteration of above loop we Store to m m 8 m 16 iterations I 3 I 2 I 1 Compute for m 24 m 32 m 40 iterations I I 1 I 2 Load from m 48 m 56 m 64 iterations I 3 I 4 I 5 9 results in 9 cycles or 1 clock per iteration Average 3 3 ops per clock 66 efficiency Note Need less registers for software pipelining only using 7 registers here was using 15 4 02 03 UCB Spring 2003 CS152 Kubiatowicz Recall Can we use HW to get CPI closer to 1 Why in HW at run time Works when can t know real dependence at compile time Compiler simpler Code for one machine runs well on another Key idea Allow instructions behind stall to proceed DIVD ADDD SUBD F0 F2 F4 F10 F0 F8 F12 F8 F14 Out of order execution out of order completion Disadvantages Complexity Precise interrupts harder Talk about this next time 4 02 03 UCB Spring 2003 CS152 Kubiatowicz Scoreboard a bookkeeping technique Out of order execution divides ID stage 1 Issue decode instructions check for structural hazards 2 Read operands wait until no data hazards then read operands Scoreboards date to CDC6600 in 1963 Instructions execute whenever not dependent on previous instructions and no hazards CDC 6600 In order issue out of order execution outof order commit or completion No forwarding Imprecise interrupt exception model for now 4 02 03 UCB Spring 2003 CS152 Kubiatowicz Registers FP FPMult Mult FP FPMult Mult FP FPDivide Divide FP FPAdd Add Integer Integer SCOREBOARD SCOREBOARD 4 02 03 UCB Spring 2003 Functional Units Scoreboard Architecture CDC 6600 Memory CS152 Kubiatowicz Scoreboard Implications Out of order completion WAR WAW hazards Solutions for WAR Stall writeback until registers have been read Read registers only during Read Operands stage Solution for WAW Detect hazard and stall issue of new instruction until other instruction completes Need to have multiple instructions in execution phase multiple execution units or pipelined execution units Scoreboard keeps track of dependencies between instructions that have already issued Scoreboard replaces ID EX WB with 4 stages 4 02 03 UCB Spring 2003 CS152 Kubiatowicz Four Stages of Scoreboard Control Issue decode instructions check for structural hazards ID1 Instructions issued in program order for hazard checking Don t issue if structural hazard Don t issue if instruction is output dependent on any previously issued but uncompleted instruction no WAW hazards Read operands wait until no data hazards then read operands ID2 All real dependencies RAW hazards resolved in this stage since we wait for instructions to write back data No forwarding …


View Full Document

Berkeley COMPSCI 152 - Lecture 16 Dynamic Scheduling: Scoreboards and Tomasulo

Documents in this Course
Quiz 5

Quiz 5

9 pages

Memory

Memory

29 pages

Quiz 5

Quiz 5

15 pages

Memory

Memory

29 pages

Memory

Memory

35 pages

Memory

Memory

15 pages

Quiz

Quiz

6 pages

Midterm 1

Midterm 1

20 pages

Quiz

Quiz

12 pages

Memory

Memory

33 pages

Quiz

Quiz

6 pages

Homework

Homework

19 pages

Quiz

Quiz

5 pages

Memory

Memory

15 pages

Load more
Loading Unlocking...
Login

Join to view Lecture 16 Dynamic Scheduling: Scoreboards and Tomasulo and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture 16 Dynamic Scheduling: Scoreboards and Tomasulo and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?