Unformatted text preview:

Pipelining Computational assembly line Pipelines each step does a small fraction 1 pipeline depth of the job concurrent exectuion of pipeline dept instructions performance is all about parallelism Vertical vs Horizontal concurrency Pipeline stage 1 step in an N step pipe Today s topics Evidence suggests there is some rust on this topic 1 cycle per stage hence spend a week and move on synchronous design slowest stage set clock rate laminar is the target also need some common terminology Simple model Attempt to present the ideal issues with some discussion on why ideal isn t reality School of Computing University of Utah 1 School of Computing University of Utah CS6810 Pipeline Benefit Performance 2 CS6810 Other Pipeline Benefits Ideal performance HW mechanism time per instruction unpiped instruction time stages hidden from the SW so invisible to the user just viewed as a benefit asymptotic overheads count 10 typically achieved No programming impact 2 ways to view this performance enhancement unless user needs the ultimate in performance usually left up to compiler scheduling optimization logical work on several instructions at once Pipelines are everywhere albeit in different stages of their execution key keep on Moore s law curve in the 80 s 90 s just moved to multiple pipelines frequency wars parallelism average IPC reduced physical shorter stages increased frequency push pipeline depth to lunatic fringe problems power frequency overheads make ideal performance a bit optimistic School of Computing University of Utah 3 School of Computing University of Utah CS6810 Page 1 4 CS6810 Consider MIPS64 Stages vary by Instructions 5 steps in instruction execution Stage 3 fetch decode execute mem write back Xeq reg reg or calculate effective address or branch target for any instruction Remember the ISA only one role Stage 4 only active on Load Store Jump Branch LMD Mem ALUoutput Mem ALUoutput SMD next PC ALUoutput w condition JUMP no condition Stage 5 Reg Reg Regs IR16 20 ALUoutput Reg Immediate Regs IR11 15 ALUoutput Load Regs IR11 15 memory data return School of Computing University of Utah 5 School of Computing University of Utah CS6810 Example 5 stage Data path 6 CS6810 Inter Stage Registers Pre IF Next PC IF ID PC 4 IR opcode RS1 RS2 RD imm16 function Wbmux value ID EX PC 4 IR1 Amux sel Bmux sel ALUop Wbmux sel R Wmem Mmux sel immediate data 16 or 26 bits EX Mem ALUout SMD mux selector indices R vs W command M WB ALUout LMD School of Computing University of Utah 7 School of Computing University of Utah CS6810 Page 2 8 CS6810 How real was that Control vs Data Example Depends Look at a few typical components real for simple architectures woefully over simplified for higher performance architectures not optimized 2 ALU s IF and EX but ALU s are cheap so who cares Harvard architecture separate instruction and data memories typical at L1 but unified below that 5x frequency for five stages slowed down by inter stage register overhea Data path is only part of the architecture largest bit in terms of area easiest bit in terms of getting it right control path FSM or microcode or both School of Computing University of Utah 9 School of Computing University of Utah CS6810 Control Path 10 CS6810 Example FSM for a simple Add Each component has control points register load or output enable mux demux select lines memory R vs W XU optcode What vs When when controlled by a clock SDR vs DDR what controlled by FSM or uCode control point values Note book ignores this for the most part fine in a way tends to consume a small amount of area and power BUT tends to be the major problem in terms of getting it right School of Computing University of Utah 11 School of Computing University of Utah CS6810 Page 3 12 CS6810 Full Control Scenario Pipeline Parallelism Best case execute 5 instructions at once Note pipeline fill and flush overhead in stead state 5x frequency ideal speedup Problem consider single I D memory step 4 5 have a resource conflict School of Computing University of Utah 13 School of Computing University of Utah CS6810 Pipeline Characteristics Unpipelined time it takes for an instruction to complete 5 steps 50 50 60 50 50 ns respectively total 260 ns worse w pipeline since latch delay added to critical path dominant feature if lots of exceptions Turn it into a pipelined design steady state doesn t last for long branch miss predicts cache misses real exceptions 10 ns of laminarity penalty 5 ns delay due to latches Throughput set up hold and fall through delays dominant feature if steady state is common Hence compiler tries hard to make this true must run at slowest stage rate clock 65 ns speedup 260 64 4x e g no cache misses register misses speculation failures real exceptions School of Computing University of Utah CS6810 Example Latency 14 rather than idealized 5x 15 School of Computing University of Utah CS6810 Page 4 16 CS6810 Pipeline Hair Pipeline Memory Issues Laminarity is hard More instructions on the fly depends a lot on F04 budget increased memory pressure bandwidth requirements 20 FO4 is somewhat easy 13 has proven to be problematic Nx for N stage pipeline Key issue w memory Extra resources it s slow each stage needs it s own bigger memories are slower and consume more power tiled improves latency but not power design drill Fixes list all possible instruction resource needs separate by stage each stage needs it s private set Harvard architecture independent roles access patterns are different Example PC modification can t use same ALU as arithmetic ops IF Mem can t access same memory optimization opportunity multi level cache memory hierarchy speculative prefetch pipeline the memory system works for both cache and main School of Computing University of Utah 17 School of Computing University of Utah CS6810 Hazards Dependencies 18 CS6810 3 Types of Hazards Consider a pair of instructions Structural R5 R2 R3 R3 R5 R6 resource contention of different pipeline stages write back of R5 happens in stage 5 R5 value needed by stage 3 OOPS register read in ID or register write in WB 2 ported register file typical arith op is 2 reads and one write 3 ports 2R and IW superscalar makes this worse Enter bypass and stalls Data value actually known at end of stage 3 used on next cycle in stage 3 send bypass value to stage 4 and to beginning of stage 3 dependency for either register source or destination Control PC incremented or computed more logic and more control branch and jump effect add mux delay


View Full Document

U of U CS 6810 - Pipelines

Documents in this Course
Caches

Caches

13 pages

Pipelines

Pipelines

14 pages

Load more
Download Pipelines
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Pipelines and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Pipelines 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?