Page 1 1 CS6810 School of Computing University of Utah Dynamic Issue & HW Speculation Today’s topics: Superscalar pipelines Dynamic Issue Scoreboarding: control centric approach Tomasulo: data centric approach 2 CS6810 School of Computing University of Utah Raising the IPC Ceiling • w/ single-issue IPCmax = 1 schedule as hard as you want and it’s still the asymptote » keeping things in order lots of stalls • XU’s finish out of order anyway » when the transistor budget is high enough • just go with multiple issue – >= 4 issue common today ::= superscalar machines • Superscalar issues: issuewidth = n need n way capability in all pipeline stages » fetch n – no worries fetch cache line of instructions/cycle » decode n • get register values – problems? » execute n • problems? » mem n • problems? w/out of order completion? » WB n • problems w/ out of order completion?Page 2 3 CS6810 School of Computing University of Utah Fix OOO Completion Problem First • Enter the ROB (re-order buffer) basic idea for now » issue instructions in-order » retire/commit instructions in order » use an intermediate buffer to hold results • since destructive action to register file or memory must happen in order • Other ROB niceties helps w/ » speculation » nullification » exceptions but first a simple example 4 CS6810 School of Computing University of Utah Reorder Buffer In Action See any problems?Page 3 5 CS6810 School of Computing University of Utah Several Issues • WB stage is now the commit stage ROB values move to the register file » whoops if tags are in the issue queue • those values need to be renamed to the register name • seems complex – can you thing of a better way? 6 CS6810 School of Computing University of Utah Several Issues • WB stage is now the commit stage ROB values move to the register file » whoops if tags are in the issue queue • those values need to be renamed to the register name • seems complex – can you think of a better way? IQ contains both register and tag fields » w/ 1 bit to select which is valid • initially tag is selected • when tag is retired – broadcast to IQ and invert selector on a match • what about tag values in the pipe – only need to worry about entry into EX stage – compares needed there as well – ROB is WB stage so that’s not a problem – MEM isn’t a problem either WHY? • Key observation all destructive operations are done by the ROB commit/retirePage 4 7 CS6810 School of Computing University of Utah Nullification & Exceptions • If an exception happens exception type is written to the ROB field » note that one instruction could generate an exception in multiple stages • only care about the first one so no overwrite is allowed • If some instruction is speculative then predicate is written to the ROB field note: predicate covers branch delay slots and effectively supports nullification • WB stage in reality try to retire n instructions per cycle » if none have pending predicates or exceptions then retire » in order retire 1st member of n-instruction bundle w/ problem • retire the instructions before • nullify whatever is next in the bundle • take the exception and hold the rest 8 CS6810 School of Computing University of Utah Decode Complexity • ROB complicates ID significantly operand fetch now has two sources » register file or ROB field • hence an additional mux is required rename takes some time » structural issue requirements will help mitigate the performance penalty • Bottom line ID will no longer be a single cycle stage • For register poor ISA’s like x86 ROB slots effectively provides a renamed register pool » actually it’s not the right choice • Why? • remember the front-end back-end x86 thingPage 5 9 CS6810 School of Computing University of Utah ROB Hazard Removal • RAW nothing changes here » no way you can use a value before it’s computed » unless the value is predicted and predicated • only some academic papers think this is a reasonable idea » hence instruction scheduling is required • Wax ROB renaming effectively removes this problem » as long as enough ROB slots exist » if not • then the instruction can’t be issued and a NOP is injected in the pipe • Note stalling pipelines @ GHz frequencies is a problem » hence NOPs are dynamically generated and pushed through the pipe » any issues here? 10 CS6810 School of Computing University of Utah EX Stages XU’s • Typical separation of XU’s ALU (int +/-, shift, logical (AND, OR, XOR, NOT) int-multiply int-divide FP ops can be 32 or 64-bit (typically implement 64-bit) » FP-add-sub » FP-multiply » FP-divide or FP-invert (1/x) » FP-sqrt or FP-isqrt? • Overlaps Branch and Mem ops can be handled with an ALU int mul or div can be handled by the FP equivalent » a common choice is to have a int-mul but not an int-div • why? actual choice influences structural issue rulesPage 6 11 CS6810 School of Computing University of Utah Structural Issue Rules • Clearly vary by machine • Example for a 6 issue machine 2 ALU 1 Branch 1 Int Mul or Divide 1 FP Add or Sub 1 Mem • Why does this make sense? e.g. justification 12 CS6810 School of Computing University of Utah Structural Issue Rules • Clearly vary by machine • Example for a 6 issue machine 2 ALU or 1 ALU and 1 Int-Mul 1 Branch 1 FP Mul or Divide 1 FP Add or Sub 1 Mem • Why does this make sense? Look at instruction frequency and common effort » Branch average about every 6 instructions so need that » LD + ST about every 6 as well » seldom need FP Mul & Divide on same cycle » FP Add/Sub share exponent normalization » Int-Divide is done on the FP-Div unitPage 7 13 CS6810 School of Computing University of Utah Dynamic Issue • Until Now instructions have been issued in order » compiler thinks the world is sequential » HW must
View Full Document