CS250 VLSI Systems Design Lecture 5 Physical Realities Beneath the Digital Abstraction Part 1 Timing Fall 2011 Krste Asanovic John Wawrzynek with John Lazzaro and Brian Zimmer Lecture 05 Timing CS250 UC Berkeley Fall 11 What do Computer Architects need to know about physics Physics effect Area cost Delay performance Energy performance cost Ideally zero delay area and energy However the physical devices occupy area take time and consume energy CMOS process lets us build transistors wires connections and we get capacitors inductors and resistors whether or not we want them Lecture 05 Timing 2 CS250 UC Berkeley Fall 11 Physical Layout Switch level abstraction gives a good way to understand the function of a circuit nFET g 1 short circuit open pFET g 0 short circuit open Understanding delay means going below the switch level abstraction to transistor physics and layout details Lecture 05 Timing 3 CS250 UC Berkeley Fall 11 Gate Delay Modern CMOS gate delays on the order of a few picoseconds However highly dependent on gate context Often expressed as FO4 delays fan out of 4 as a process independent delay metric the delay of an inverter driven by an inverter 4x smaller than itself and driving an inverter 4x larger than itself For our 90nm process FO4 is around 20ps Lecture 05 Timing 4 CS250 UC Berkeley Fall 11 Path Delay For correct operation Total Delay clock period FFsetup time FFclk to q Clock skew on all paths High speed processors critical paths have around 10 20 FO4 delays Lecture 05 Timing CS250 UC Berkeley Fall 11 5 FO4 Delays per clock period FO4 Delays 88 CPU Clock Periods 1985 2005 B8 7A 4A C 3D C 3D 6 C 3D 7 C 3D 4 3D E CF 6 8 4 E CF 6 4 E CF 6 6 4 9C 93C 9C 9C 4 G C HI IE I J IK EGL M EGL M EGL NA O 4 A8 8 MIPS 2000 5 stages 8 Pentium Pro 10 stages 8 48 Historical limit about 12 78 68 8 Pentium 4 20 stages 8 A A A 0 12 13 AA AB B8 B B6 B7 B4 B B B BA BB 88 8 86 456756887 87 84 8 6 9 1 Thanks to Francois Labonte Stanford Lecture 05 Timing CS250 UC Berkeley Fall 11 Gate Delay What determines the actual delay of a logic gate Transistors are not perfect switches cannot change terminal voltages instantaneously Consider the NAND gate Current I value depends on process parameters transistor size C I L CL models gate output wire inputs to next stage Cap of Load C integrates I creating a voltage change at output Lecture 05 Timing CS250 UC Berkeley Fall 11 7 More on transistor Current Transistors act like a cross bet ween a resistor and current source ISAT depends on process parameters higher for nFETs than for pFETs and transistor size layout ISAT Lecture 05 Timing 8 W L CS250 UC Berkeley Fall 11 More on CL Everything that connects to the output of a logic gate or transistor contributes capacitance Transistor I Lecture 05 Timing drains Interconnection wires contacts vias Transistor Gates CS250 UC Berkeley Fall 11 9 Wires So far simple capacitors C Area width length Wires have finite resistance so have distributed R and C with r res length c cap length rcL 2 rc 2rc 3rc For short wires bet ween gates R is insignificant total RC delay gate delay For long wires R becomes significant Ex busses clocks reset rebuffering helps Lecture 05 Timing 10 CS250 UC Berkeley Fall 11 Turning Rise Fall Delay into Gate Delay Cascaded gates transfer curve for inverter Lecture 05 Timing CS250 UC Berkeley Fall 11 11 Driving Large Loads Large fanout nets clocks resets memory bit lines off chip Relatively small driver results in long rise time and thus large gate delay Strategy Staged Buffers Optimal trade off bet ween delay per stage and total number of stages fanout of 4 6 per stage Lecture 05 Timing 12 CS250 UC Berkeley Fall 11 Components of Path Delay 1 2 3 4 5 6 of levels of logic Internal cell delay wire delay cell input capacitance cell fanout cell output drive strength Lecture 05 Timing CS250 UC Berkeley Fall 11 13 Who controls the delay foundary engineer TSMC Library Developer Aritsan 1 of levels 2 Internal cell delay 3 Wire delay 4 Cell input capacitance 5 Cell fanout 6 Cell drive strength Lecture 05 Timing physical parameters physical parameters physical parameters CAD Tools DC IC Compiler Designer Brian Z synthesis RTL cell topology cell selection trans sizing place route cell topology cell selection instantiation trans sizing synthesis physical parameters layout generator transistor sizing 14 RTL cell selection instantiation CS250 UC Berkeley Fall 11 Timing Closure Searching for and beating down the critical path IEEE JOURNAL OF SOLID STATE CIRCUITS VOL 36 NO 11 NOVEMBER 2001 s st late ng rs g as Synthesis tools work to meet clock constraint report delays on paths and of course simulators can be used to determine timing performance Tools that are expected to do something about the timing behavior such as synthesizers also include provisions for specifying input arrival times relative to the clock and output requirements set up times of next stage Fig 2 Microprocessor pipeline organization shown in Fig 2 where the state boundaries are indicated by gray Features that allow the microarchitecture to achieve high speed are as follows The shifter and ALU reside in separate stages The ARM instruction set allows a shift followed by an ALU operation in a single instruction Previous implementations limited frequency by having the shift and ALU in a single stage Splitting this operation reduces the critical ALU bypass path by approximately 1 3 The extra pipeline hazard introduced when an instruction is immediately followed by one requiring that the result be shifted is infrequent Decoupled Instruction Fetch A two instruction deep queue is implemented between the second fetch and instruction decode pipe stages This allows stalls generated later in the pipe to be deferred by one or more cycles in the earlier pipe stages thereby allowing instruction fetches to proceed when the pipe is stalled and also relieves stall speed paths in the instruction fetch and branch prediction units Deferred register dependency stalls While register dependencies are checked in the RF stage stalls due to these hazards are deferred until the X1 stage All the necessary operands are then captured from result forwarding busses as the results are returned to the register file One of the major goals of the design was to minimize the energy consumed to complete a given task Conventional wisdom has been that shorter pipelines are more efficient due to re Timing Analysis real example The critical path Late mode timing checks thousands
View Full Document