Page 1 1 CS6810 School of Computing University of Utah Pipelines Today’s topics: • Evidence suggests there is some rust on this topic • hence spend a week and move on • also need some common terminology • Attempt to present the ideal issues • with some discussion on why ideal isn’t reality. 2 CS6810 School of Computing University of Utah Pipelining • Computational assembly line each step does a small fraction 1/pipeline_depth of the job concurrent exectuion of pipeline_dept instructions » performance is all about parallelism • Vertical vs. Horizontal concurrency • Pipeline stage – 1 step in an N step pipe 1 cycle per stage » synchronous design – slowest stage set clock rate » laminar is the target • Simple modelPage 2 3 CS6810 School of Computing University of Utah Pipeline Benefit = Performance • Ideal performance time-per-instruction = unpiped_instruction time/#stages » asymptotic – overheads count • +10% typically achieved • 2 ways to view this performance enhancement logical » work on several instructions at once • albeit in different stages of their execution » parallelism • average IPC reduced physical » shorter stages = increased frequency 4 CS6810 School of Computing University of Utah Other Pipeline Benefits • HW mechanism hidden from the SW so invisible to the user just viewed as a benefit • No programming impact unless user needs the ultimate in performance usually left up to compiler scheduling & optimization • Pipelines are everywhere key keep on Moore’s law curve in the 80’s 90’s just moved to multiple pipelines frequency wars » push pipeline depth to lunatic fringe • problems – power α frequency – overheads make ideal performance a bit optimisticPage 3 5 CS6810 School of Computing University of Utah Consider MIPS64 • 5 steps in instruction execution fetch, decode, execute, mem, write-back • Remember the ISA 6 CS6810 School of Computing University of Utah Stages vary by Instructions • Stage 3 Xeq reg-reg or calculate effective address or branch target » for any instruction • only one role • Stage 4 only active on Load/Store/Jump/Branch » LMD Mem[ALUoutput] » Mem[ALUoutput] SMD » next PC = ALUoutput w/ condition • JUMP – no condition • Stage 5 Reg-Reg » Regs[IR16..20] ALUoutput Reg – Immediate » Regs[IR11..15] ALUoutput Load » Regs[IR11..15] memory data returnPage 4 7 CS6810 School of Computing University of Utah Example 5-stage Data-path 8 CS6810 School of Computing University of Utah Inter_Stage Registers • Pre-IF Next PC • IF:ID PC+4 IR: opcode, RS1, RS2, RD, imm16, function Wbmux value • ID:EX PC+4 IR1: Amux_sel, Bmux_sel, ALUop, Wbmux_sel, R/Wmem, Mmux_sel immediate data: 16 or 26 bits • EX:Mem ALUout, SMD, mux selector indices, R vs. W command • M:WB ALUout, LMDPage 5 9 CS6810 School of Computing University of Utah How real was that? • Depends real for simple architectures » woefully over simplified for higher performance architectures not optimized » 2 ALU’s • IF and EX – but ALU’s are cheap so who cares? » Harvard architecture • separate instruction and data memories – typical at L1 – but unified below that » 5x frequency for five stages • slowed down by inter-stage register overhea • Data-path is only part of the architecture largest bit in terms of area easiest bit in terms of getting it right control path » FSM or microcode or both? 10 CS6810 School of Computing University of Utah Control vs. Data Example • Look at a few typical componentsPage 6 11 CS6810 School of Computing University of Utah Control Path • Each component has control points register: load or output enable mux/demux: select lines memory: R vs. W XU – optcode • What vs. When when controlled by a clock » SDR vs. DDR what controlled by FSM or uCode control point values • Note book ignores this for the most part » fine in a way • tends to consume a small amount of area and power • BUT tends to be the major problem – in terms of getting it right!! 12 CS6810 School of Computing University of Utah Example: FSM for a simple AddPage 7 13 CS6810 School of Computing University of Utah Full Control Scenario 14 CS6810 School of Computing University of Utah Pipeline Parallelism • Best case – execute 5 instructions at once Note pipeline fill and flush overhead in stead state » 5x frequency ideal speedup • Problem consider single I & D memory » step 4 & 5 have a resource conflictPage 8 15 CS6810 School of Computing University of Utah Pipeline Characteristics • Latency time it takes for an instruction to complete » worse w/ pipeline since latch delay added to critical path » dominant feature if lots of exceptions • steady state doesn’t last for long • branch miss_predicts, cache misses, real exceptions • Throughput dominant feature if steady state is common » compiler tries hard to make this true e.g. no » cache misses » register misses » speculation failures » real exceptions 16 CS6810 School of Computing University of Utah Example • Unpipelined 5 steps: 50, 50, 60, 50, 50 ns respectively total 260 ns • Turn it into a pipelined design 10 ns of “laminarity” penalty 5 ns delay due to latches » set-up, hold, and fall through delays • Hence must run at slowest stage rate/clock = 65 ns speedup 260/64 = 4x » rather than idealized 5xPage 9 17 CS6810 School of Computing University of Utah Pipeline Hair • Laminarity is hard depends a lot on F04 budget » 20+ FO4 is somewhat easy » 13- has proven to be problematic • Extra resources each stage needs it’s own » design drill • list all possible instruction resource needs • separate by stage • each stage needs it’s private set • Example PC modification can’t use same ALU as arithmetic
View Full Document