U of U CS 6810 - Pipelines - D1674957

Home> Schools> University of Utah> Computer Science (CS) > CS 6810> Pipelines

U of U CS 6810 - Pipelines

Pages 14

Download Save

Unformatted text preview:

Page 1 1 CS6810 School of Computing University of Utah Pipelines Today’s topics: • Evidence suggests there is some rust on this topic • hence spend a week and move on • also need some common terminology • Attempt to present the ideal issues • with some discussion on why ideal isn’t reality. 2 CS6810 School of Computing University of Utah Pipelining • Computational assembly line  each step does a small fraction 1/pipeline_depth of the job  concurrent exectuion of pipeline_dept instructions » performance is all about parallelism • Vertical vs. Horizontal concurrency • Pipeline stage – 1 step in an N step pipe  1 cycle per stage » synchronous design – slowest stage set clock rate » laminar is the target • Simple modelPage 2 3 CS6810 School of Computing University of Utah Pipeline Benefit = Performance • Ideal performance  time-per-instruction = unpiped_instruction time/#stages » asymptotic – overheads count • +10% typically achieved • 2 ways to view this performance enhancement  logical » work on several instructions at once • albeit in different stages of their execution » parallelism • average IPC reduced  physical » shorter stages = increased frequency 4 CS6810 School of Computing University of Utah Other Pipeline Benefits • HW mechanism  hidden from the SW so invisible to the user  just viewed as a benefit • No programming impact  unless user needs the ultimate in performance  usually left up to compiler scheduling & optimization • Pipelines are everywhere  key keep on Moore’s law curve in the 80’s  90’s just moved to multiple pipelines  frequency wars » push pipeline depth to lunatic fringe • problems – power α frequency – overheads make ideal performance a bit optimisticPage 3 5 CS6810 School of Computing University of Utah Consider MIPS64 • 5 steps in instruction execution  fetch, decode, execute, mem, write-back • Remember the ISA 6 CS6810 School of Computing University of Utah Stages vary by Instructions • Stage 3  Xeq reg-reg or calculate effective address or branch target » for any instruction • only one role • Stage 4  only active on Load/Store/Jump/Branch » LMD  Mem[ALUoutput] » Mem[ALUoutput]  SMD » next PC = ALUoutput w/ condition • JUMP – no condition • Stage 5  Reg-Reg » Regs[IR16..20]  ALUoutput  Reg – Immediate » Regs[IR11..15]  ALUoutput  Load » Regs[IR11..15]  memory data returnPage 4 7 CS6810 School of Computing University of Utah Example 5-stage Data-path 8 CS6810 School of Computing University of Utah Inter_Stage Registers • Pre-IF  Next PC • IF:ID  PC+4  IR: opcode, RS1, RS2, RD, imm16, function  Wbmux value • ID:EX  PC+4  IR1: Amux_sel, Bmux_sel, ALUop, Wbmux_sel, R/Wmem, Mmux_sel  immediate data: 16 or 26 bits • EX:Mem  ALUout, SMD, mux selector indices, R vs. W command • M:WB  ALUout, LMDPage 5 9 CS6810 School of Computing University of Utah How real was that? • Depends  real for simple architectures » woefully over simplified for higher performance architectures  not optimized » 2 ALU’s • IF and EX – but ALU’s are cheap so who cares? » Harvard architecture • separate instruction and data memories – typical at L1 – but unified below that » 5x frequency for five stages • slowed down by inter-stage register overhea • Data-path is only part of the architecture  largest bit in terms of area  easiest bit in terms of getting it right  control path » FSM or microcode or both? 10 CS6810 School of Computing University of Utah Control vs. Data Example • Look at a few typical componentsPage 6 11 CS6810 School of Computing University of Utah Control Path • Each component has control points  register: load or output enable  mux/demux: select lines  memory: R vs. W  XU – optcode • What vs. When  when controlled by a clock » SDR vs. DDR  what controlled by FSM or uCode control point values • Note  book ignores this for the most part » fine in a way • tends to consume a small amount of area and power • BUT tends to be the major problem – in terms of getting it right!! 12 CS6810 School of Computing University of Utah Example: FSM for a simple AddPage 7 13 CS6810 School of Computing University of Utah Full Control Scenario 14 CS6810 School of Computing University of Utah Pipeline Parallelism • Best case – execute 5 instructions at once  Note pipeline fill and flush overhead  in stead state » 5x frequency  ideal speedup • Problem  consider single I & D memory » step 4 & 5 have a resource conflictPage 8 15 CS6810 School of Computing University of Utah Pipeline Characteristics • Latency  time it takes for an instruction to complete » worse w/ pipeline since latch delay added to critical path » dominant feature if lots of exceptions • steady state doesn’t last for long • branch miss_predicts, cache misses, real exceptions • Throughput  dominant feature if steady state is common » compiler tries hard to make this true  e.g. no » cache misses » register misses » speculation failures » real exceptions 16 CS6810 School of Computing University of Utah Example • Unpipelined  5 steps: 50, 50, 60, 50, 50 ns respectively  total 260 ns • Turn it into a pipelined design  10 ns of “laminarity” penalty  5 ns delay due to latches » set-up, hold, and fall through delays • Hence  must run at slowest stage rate/clock = 65 ns  speedup 260/64 = 4x » rather than idealized 5xPage 9 17 CS6810 School of Computing University of Utah Pipeline Hair • Laminarity is hard  depends a lot on F04 budget » 20+ FO4 is somewhat easy » 13- has proven to be problematic • Extra resources  each stage needs it’s own » design drill • list all possible instruction resource needs • separate by stage • each stage needs it’s private set • Example  PC modification can’t use same ALU as arithmetic

View Full Document


School:
Email:
New Password:
Confirm Password:

U of U CS 6810 - Pipelines

Sign up for free to view:

Please select your school