1EECS 150 - Components and Design Techniques for Digital SystemsLec 21 – RTL Design Optimization11/16/2004David CullerElectrical Engineering and Computer SciencesUniversity of California, Berkeleyhttp://www.eecs.berkeley.edu/~cullerhttp://www-inst.eecs.berkeley.edu/~cs1502Recall: Register Trans. Level Descriptions• A standard high-level representation for describing systems.• It follows from the fact that all synchronous digital system can be described as a set of state elements connected by combination logic (CL) blocks:• RTL comprises a set of register transfers with optional operators as part of the transfer.• Example:regA ←←←← regBregC ←←←← regA + regBif (start==1) regA ←←←← regC• Personal style:– use “;” to separate transfers that occur on separate cycles.– Use “,” to separate transfers that occur on the same cycle.• Example (2 cycles):regA ←←←← regB, regB ←←←← 0;regC ←←←← regA;reg regCL CLclock inputoutputoption feedbackinputoutput3Components of the data path• Storage– Flip-flops– Registers– Register Files– SRAM• Arithmetic Units– Adders, subtraters, ALUs (built out of FAs or gates)– Comparators– Counters• Interconnect– Wires– Busses– Tri-state Buffers– MUX4Arithmetic Circuit Design (MT revisited)• Full Adder• Adder• Relationship of positional notation and operations on it to arithmetic circuitsFAA B CinCo S5Recall: Computer Organization• Computer design as an application of digital logic design procedures• Computer = processing unit + memory system• Processing unit = control + datapath• Control = finite state machine– Inputs = machine instruction, datapath conditions– Outputs = register transfer control signals, ALU operation codes– Instruction interpretation = instruction fetch, decode, execute• Datapath = functional units + registers + interconnect– Functional units = ALU, multipliers, dividers, etc.– Registers = program counter, shifters, storage registers– Interconenct = busses and wires• Instruction Interpreter vs Fixed Function Device6Datapath vs Control• Datapath: Storage, FU, interconnect sufficient to perform the desired functions– Inputs are Control Points– Outputs are signals• Controller: State machine to orchestrate operation on the data path– Based on desired function and signalsDatapathControllerControl Pointssignals7Vector Op Example (more MT)8List Processor Example• RTL gives us a framework for making high-level optimizations.– Fixed function unit– Approach extends to instruction interpreters• General design procedure outline:1. Problem, Constraints, and Component Library Spec.2. “Algorithm” Selection3. Micro-architecture Specification4. Analysis of Cost, Performance, Power5. Optimizations, Variations6. Detailed Design91. Problem Specification• Design a circuit that forms the sum of all the 2's complements integers stored in a linked-list structure starting at memory address 0:• All integers and pointers are 8-bit. The link-list is stored in a memory block with an 8-bit address port and 8-bit data port, as shown below. The pointer from the last element in the list is 0.At least one node in list.I/Os:– START resets to head of list and starts addition process.– DONE signals completion– R, Bus that holds the final result101. Other Specifications• Design Constraints:– Usually the design specification puts a restriction on cost, performance, power or all. We will leave this unspecified for now and return to it later.• Component Library:component delaysimple logic gates 0.5nsn-bit register clk-to-Q=0.5nssetup=0.5ns (data and LD)n-bit 2-1 multiplexor 1nsn-bit adder (2 log(n) + 2)nsmemory 10ns read (asynchronous read)zero compare 0.5 log(n)(single ported memory)Are these reasonable?112. Algorithm Specification• In this case the memory only allows one access per cycle, so thealgorithm is limited to sequential execution. If in another casemore input data is available at once, then a more parallel solution may be possible.• Assume datapath state registers NEXT and SUM.– NEXT holds a pointer to the node in memory.– SUM holds the result of adding the node values to this point.If (START==1) NEXT0, SUM0;repeat {SUMSUM + Memory[NEXT+1];NEXTMemory[NEXT];} until (NEXT==0);RSUM, DONE1; 12A_SEL01NEXT01+MemoryDA==0+01SUMNEXT_SELLD_NEXTNEXT_ZEROSUM_SELLD_SUM0103. Architecture #1Direct implementation of RTL description:DatapathControllerIf (START==1) NEXT0, SUM0;repeat {SUMSUM + Memory[NEXT+1];NEXTMemory[NEXT];} until (NEXT==0);RSUM, DONE1;134. Analysis of Cost, Performance, and Power• Skip Power for now.• Cost:– How do we measure it? # of transistors? # of gates? # of CLBs?– Depends on implementation technology. Usually we are interested in comparing the relative cost of two competing implementations. (Save this for later)• Performance:– 2 clock cycles per number added.– What is the minimum clock period?– The controller might be on the critical path. Therefore we need to know the implementation, and controller input and output delay.14Possible Controller ImplementationSTARTCOMPSUMGETNEXTDONELD_SUMSUM_SELLD_NEXTNEXT_SELDONEA_SELSTARTSTARTSTARTNEXT_ZERO• Based on this, what is the controller input and output delay?15MT again…• Longest path from any reg out to any reg input164. Analysis of PerformanceCLK-Q MUX8-bit addmemory 15-bit addMUXsetup.5 811010 1.531nsCLKNEXTCLKA_SELMUXcontrol output delaymemoryMUX==0control input delay.5 1011.511.515.5nsCOMPUTE_SUM stateGET_NEXT stateNEXT_ZERO17Critical pathsA_SEL01NEXT01+MemoryDA==0+01SUMNEXT_SELLD_NEXTNEXT_ZEROSUM_SELLD_SUM010184. Analysis of Performance• Detailed timing:clock period (T) = max (clock period for each state)T > 31ns, F < 32 MHz• Observation:COMPUTE_SUM state does most of the work. Most of the componentsare inactive in GET_NEXT state.GET_NEXT does: Memory access + …COMPUTE_SUM does: 8-bit add, memory access, 15-bit add + …• Conclusion:Move one of the adds to GET_NEXT.195. Optimization• Add new register named NUMA, for address of number to add.• Update RTL to reflect our change (note still 2 cycles per iteration):If (START==1) NEXT0, SUM0, NUMA1;repeat {SUMSUM + Memory[NUMA];NUMAMemory[NEXT] + 1,NEXTMemory[NEXT] ;} until
View Full Document