DOC PREVIEW
UMD CMSC 411 - Lecture 10 Instruction Level Parallelism 4

This preview shows page 1-2 out of 7 pages.

Save
View full document
Premium Document
Do you want full access? Go Premium and unlock all 7 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Outline CMSC 411 Computer Systems Architecture Lecture 10 Instruction Level Parallelism 4 Dynamic Scheduling Tomasulo Algorithm ILP Compiler techniques to increase ILP Loop Unrolling Static Branch Prediction Dynamic Branch Prediction Overcoming Data Hazards with Dynamic Scheduling Start Tomasulo Algorithm Conclusion CMSC 411 8 from Patterson Advantages of Dynamic Scheduling HW Schemes Instruction Parallelism Dynamic scheduling hardware rearranges the instruction execution to reduce stalls while maintaining data flow and exception behavior Handles cases when dependences unknown at compile time it allows the processor to tolerate unpredictable delays such as cache misses by executing other code while waiting for the miss to resolve Allows code that compiled for one pipeline to run efficiently on a different pipeline Simplifies the compiler Leads to hardware speculation a technique with significant performance advantages discuss later Key idea Allow instructions behind stall to proceed CMSC 411 8 from Patterson 3 DIVD ADDD SUBD F0 F2 F4 F10 F0 F8 F12 F8 F14 Enables out of order execution and allows out of order completion e g SUBD In a dynamically scheduled pipeline all instructions still pass through issue stage in order in order issue Will distinguish when an instruction begins execution and when it completes execution between 2 times the instruction is in execution Note Dynamic execution creates WAR and WAW hazards and makes handling exceptions harder CMSC 411 8 from Patterson Dynamic Scheduling Step 1 A Dynamic Algorithm Tomasulo s Simple pipeline had 1 stage to check both structural and data hazards Instruction Decode ID also called Instruction Issue Split the ID pipe stage of simple 5 stage pipeline into 2 stages Issue For IBM 360 91 before caches Long memory latency Goal High Performance without special compilers Small number of floating point registers 4 in 360 prevented interesting compiler scheduling of operations This led Tomasulo to try to figure out how to get more effective registers renaming in hardware Decode instructions check for structural hazards Read operands 2 4 Wait until no data hazards then read operands Why study a 1966 computer The descendants of this have flourished Alpha 21264 Pentium 4 AMD Opteron Power 5 CMSC 411 8 from Patterson CS252 S05 5 CMSC 411 8 from Patterson 6 Tomasulo Algorithm Tomasulo Algorithm cont Control buffers distributed with Function Units FU FU buffers called reservation stations have pending operands Registers in instructions replaced by values or pointers to reservation stations RS called register renaming Renaming avoids WAR WAW hazards More reservation stations than registers so can do optimizations compilers can t Results to FU from RS not through registers over Common Data Bus that broadcasts results to all FUs Avoids RAW hazards by executing an instruction only when its operands are available Load and Stores treated as FUs with RSs as well Integer instructions can go past branches use branch prediction also allow FP ops beyond basic block in FP queue 7 CMSC 411 8 from Patterson From H P Figure 2 9 Tomasulo Organization Reservation Station Components FP Registers FP Op Queue Load Buffers From Mem Load1 Load2 Load3 Load4 Load5 Load6 Op Operation to perform in the unit e g or Vj Vk Value of Source operands Store buffers have V field result to be stored Qj Qk Reservation stations producing source registers value to be written Note Qj Qk 0 ready Store buffers only have Qi for RS producing result Busy Indicates reservation station or FU is busy Store Buffers Add1 Add2 Add3 Mult1 Mult2 FP FP adders adders Reservation Stations To Mem FP FP multipliers multipliers Common Data Bus CDB 9 CMSC 411 10 from Patterson In addition Register result status table Indicates which functional unit will write each register if one exists Blank when no pending instructions that will write that register CMSC 411 10 from Patterson 10 Three Stages of Tomasulo Algorithm Common Data Bus 1 Issue get instruction from FP Op Queue If reservation station free no structural hazard control issues instr sends operands renames registers 2 Execute operate on operands EX When both operands ready then execute if not ready watch Common Data Bus for result 3 Write result finish execution WB Write on Common Data Bus to all awaiting units mark reservation station available Normal data bus data destination go to bus Common data bus data source come from bus 64 bits of data 4 bits of Functional Unit source address Write if matches expected Functional Unit produces result Does the broadcast CMSC 411 10 from Patterson CS252 S05 8 CMSC 411 8 from Patterson 11 CMSC 411 10 from Patterson 12 Tomasulo Example Tomasulo Example Cycle 1 Instruction stream Instruction status Instruction LD F6 LD F2 MULTD F0 SUBD F8 DIVD F10 ADDD F6 j 34 45 F2 F6 F0 F8 k R2 R3 F4 F2 F6 F2 Register result status Clock 0 Busy Address Load1 Load2 Load3 Instruction LD F6 LD F2 MULTD F0 SUBD F8 DIVD F10 ADDD F6 No No No 3 Load Buffers Reservation Stations Time Name Busy Add1 No Add2 No FU count Add3 No down Mult1 No Mult2 No Instruction status Exec Write Issue Comp Result S1 Vj Op S2 Vk RS Qj Latencies LD 1 ADD 2 MULT 10 DIV 40 RS Qk F2 F4 F6 F8 F10 F12 k R2 R3 F4 F2 F6 F2 Exec Write Issue Comp Result Time Name Busy Add1 No Add2 No Add3 No Mult1 No Mult2 No F30 Register result status Clock 1 FU Busy Address 1 Load1 Load2 Load3 Reservation Stations 3 FP Adder R S 2 FP Mult R S F0 j 34 45 F2 F6 F0 F8 Op S1 Vj S2 Vk F0 F2 F4 RS Qj RS Qk F6 F8 Yes No No 34 R2 Latencies LD 1 ADD 2 MULT 10 DIV 40 F10 F12 F30 Load1 FU Clock cycle counter 13 CMSC 411 10 from Patterson Tomasulo Example Cycle 2 Instruction status Instruction LD F6 LD F2 MULTD F0 SUBD F8 DIVD F10 ADDD F6 j 34 45 F2 F6 F0 F8 k R2 R3 F4 F2 F6 F2 1 2 Time Name Busy Add1 No Add2 No Add3 No Mult1 No Mult2 No Register result status Clock 2 Busy Address 2 Load1 Load2 Load3 Executes instruction at cycle Reservation Stations S1 Vj Op F0 F2 S2 Vk RS Qj F4 Load2 FU Tomasulo Example Cycle 3 Instruction status Exec Write Issue Comp Result F6 Yes Yes No Instruction LD F6 LD F2 MULTD F0 SUBD F8 DIVD F10 ADDD F6 34 R2 45 R3 Latencies LD 1 ADD 2 MULT 10 DIV 40 RS Qk F8 F10 F12 F30 k R2 R3 F4 F2 F6 F2 Reservation Stations 2 3 3 4 4 S1 S2 4 FU F0 F2 Mult1 Load2 RS No Yes No F6 F8 F10 45 R3 S2 Vk RS Qj Yes Yes No 34 R2 45 R3 Latencies LD 1 ADD 2 MULT 10 DIV 40 RS Qk R F4 Load2 F2 F4 Mult1 Load2 F12 j 34 45 F2 F6 F0 F8 k R2 R3 F4 F2 F6 F2 F6 F8 F10 F12 F30 Load1 16 Exec


View Full Document

UMD CMSC 411 - Lecture 10 Instruction Level Parallelism 4

Documents in this Course
Load more
Download Lecture 10 Instruction Level Parallelism 4
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture 10 Instruction Level Parallelism 4 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture 10 Instruction Level Parallelism 4 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?