Berkeley COMPSCI 250 - Lecture 5: Physical Realities: Beneath the Digital Abstraction - D572524

Home> Schools> University of California, Berkeley> Computer Science (COMPSCI) > COMPSCI 250> Lecture 5: Physical Realities: Beneath the Digital Abstraction

Berkeley COMPSCI 250 - Lecture 5: Physical Realities: Beneath the Digital Abstraction

School name University of California, Berkeley

Course Compsci 250- VLSI Systems Design

Pages 10

Download Save

Unformatted text preview:

CS250, UC Berkeley Fall ‘11Lecture 05, TimingCS250VLSI Systems DesignLecture 5: Physical Realities: Beneath the Digital Abstraction,Part 1: TimingFall 2011Krste Asanovic’, John WawrzynekwithJohn LazzaroandBrian ZimmerCS250, UC Berkeley Fall ‘11Lecture 05, TimingWhat do Computer Architects need to know about physics?‣Physics effect: Area 󲰛 costDelay 󲰛 performanceEnergy 󲰛 performance & cost•Ideally, zero delay, area, and energy. However, the physical devices occupy area, take time, and consume energy.•CMOS process lets us build transistors, wires, connections, and we get capacitors, inductors, and resistors whether or not we want them.2CS250, UC Berkeley Fall ‘11Lecture 05, TimingPhysical Layout‣“Switch-level” abstraction gives a good way to understand the function of a circuit.‣nFET (g=1 ? short circuit : open)‣pFET (g=0 ? short circuit : open)‣Understanding delay means going below the switch-level abstraction to transistor physics and layout details.3CS250, UC Berkeley Fall ‘11Lecture 05, Timing“Gate Delay”‣Modern CMOS gate delays on the order of a few picoseconds. (However, highly dependent on gate context.)‣Often expressed as FO4 delays (fan-out of 4) - as a process independent delay metric: ‣the delay of an inverter, driven by an inverter 4x smaller than itself, and driving an inverter 4x larger than itself.‣For our 90nm process FO4 is around 20ps.4CS250, UC Berkeley Fall ‘11Lecture 05, Timing“Path Delay”‣For correct operation:Total Delay ≤ clock_period - FFsetup_time - FFclk_to_q - Clock_skewon all paths.5‣High-speed processors critical paths have around 10-20 FO4 delays.CS250, UC Berkeley Fall ‘11Lecture 05, TimingFO4 Delays per clock period !"#$%&'()*#+&$,-./#+&$,-0(,#$.&"12-13456756887 9,#$.&"1):$';-"(',<!"#$%&'(&)*+8=8687848>8?8@8A8B8=88A> A? A@ AA AB B8 B= B6 B7 B4 B> B? B@ BA BB 88 8= 86 87 84 8>'$,-/)7A?'$,-/)4A?'$,-/)C-$,'3D'$,-/)C-$,'3D)6'$,-/)C-$,'3D)7'$,-/)C-$,'3D)4'$,-/)',#$'3DE/CF#)6=8?4E/CF#)6==?4E/CF#)6=6?49C#"%93C-"9C#"%9C#"%?4G'C(HI)IEI&J-")IKEGL)M?EGL)M@EGL)NA?O?4Thanks to Francois Labonte, StanfordFO4DelaysHistoricallimit:about12CPU Clock Periods1985-2005MIPS 20005 stages Pentium 420 stages Pentium Pro10 stages 6CS250, UC Berkeley Fall ‘11Lecture 05, Timing“Gate Delay”‣What determines the actual delay of a logic gate?‣Transistors are not perfect switches - cannot change terminal voltages instantaneously.‣Consider the NAND gate:‣Current (I) value depends on: process parameters, transistor size7‣CL models gate output, wire, inputs to next stage (Cap. of Load)‣C “integrates” I creating a voltage change at output∆ ∝ CL / ICS250, UC Berkeley Fall ‘11Lecture 05, TimingMore on transistor Current‣Transistors act like a cross between a resistor and “current source”8‣ISAT depends on process parameters (higher for nFETs than for pFETs) and transistor size (layout):ISAT ∝ W/LCS250, UC Berkeley Fall ‘11Lecture 05, TimingMore on CL‣Everything that connects to the output of a logic gate (or transistor) contributes capacitance:9‣Transistor drains‣Interconnection (wires/contacts/vias)‣Transistor GatesICS250, UC Berkeley Fall ‘11Lecture 05, TimingWires‣So far, simple capacitors:10C ∝ Area = width ∗ length‣Wires have finite resistance, so have distributed R and C:with r = res/length, c = cap/length, ∆ ∝ rcL2 ≅ rc + 2rc +3rc + ...‣For short wires (between gates) R is insignificant (total RC delay << gate delay)‣For long wires R becomes significant. Ex: busses, clocks, reset ‣“rebuffering” helpsCS250, UC Berkeley Fall ‘11Lecture 05, TimingTurning Rise/Fall Delay into Gate Delay• Cascaded gates:“transfer curve” for inverter.11CS250, UC Berkeley Fall ‘11Lecture 05, TimingDriving Large Loads‣Large fanout nets: clocks, resets, memory bit lines, off-chip‣Relatively small driver results in long rise time (and thus large gate delay)‣Strategy:‣Optimal trade-off between delay per stage and total number of stages 󲰛 fanout of ∼4-6 per stage12Staged BuffersCS250, UC Berkeley Fall ‘11Lecture 05, TimingComponents of Path Delay1. # of levels of logic2. Internal cell delay3. wire delay4. cell input capacitance5. cell fanout6. cell output drive strength13CS250, UC Berkeley Fall ‘11Lecture 05, TimingWho controls the delay?14foundary engineer (TSMC)Library Developer (Aritsan)CAD Tools (DC, IC Compiler)Designer (Brian Z)1. # of levelssynthesisRTL2. Internal cell delayphysical parameterscell topology, trans sizingcell selection3. Wire delayphysical parametersplace & routelayout generator4. Cell input capacitancephysical parameterscell topology, trans sizingcell selectioninstantiation5. Cell fanoutsynthesisRTL6. Cell drive strengthphysical parameterstransistor sizingcell selectioninstantiationTiming Closure: Searching for and beating down the critical path1600 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 36, NO. 11, NOVEMBER 2001Fig. 1. Process SEM cross section.The process was raised from [1] to limit standby power.Circuit design and architectural pipelining ensure low voltageperformance and functionality. To further limit standby currentin handheld ASSPs, a longer poly target takes advantage of theversus dependence and source-to-body bias is usedto electrically limit transistor in standby mode. All corenMOS and pMOS transistors utilize separate source and bulkconnections to support this. The process includes cobalt disili-cide gates and diffusions. Low source and drain capacitance, aswell as 3-nm gate-oxide thickness, allow high performance andlow-voltage operation.III. ARCHITECTUREThe microprocessor contains 32-kB instruction and datacaches as well as an eight-entry coalescing writeback buffer.The instruction and data cache fill buffers have two and fourentries, respectively. The data cache supports hit-under-missoperation and lines may be locked to allow SRAM-like oper-ation. Thirty-two-entry fully associative translation lookasidebuffers (TLBs) that support multiple page sizes are providedfor both caches. TLB entries may also be locked. A 128-entrybranch target buffer improves branch performance a pipelinedeeper than earlier high-performance ARM designs [2], [3].A. Pipeline OrganizationTo obtain high performance, the microprocessor core utilizesa simple scalar pipeline and a high-frequency clock. In additionto avoiding the potential power waste of a superscalar approach,functional design and validation complexity is decreased at

View Full Document


School:
Email:
New Password:
Confirm Password:

Berkeley COMPSCI 250 - Lecture 5: Physical Realities: Beneath the Digital Abstraction

Sign up for free to view:

Please select your school