Ramana Kompella Lecture 2: Implementation Models CS 636 Internetworking Spring 2009 1CS 636 Internetworking Spring 2009 Before we can play with improvements, we must know “the rules of the game”. Algorithmics uses four separate areas: protocols, architecture, OS and algorithms Question: How can a logic designer understand protocols and how can an algorithm designer understand hardware ◦ Use simple models that have explanatory and predictive power ◦ Leave details to the specialist 2CS 636 Internetworking Spring 2009 Networking (protocols) Hardware Architecture Operating systems 3CS 636 Internetworking Spring 2009 Interfaces ◦ Used by applications or higher level protocols Packet (message) format ◦ Examples: IP,TCP, UDP Typical protocol operations ◦ Data manipulation ◦ Demultiplexing ◦ Looking up state ◦ Set timers and receive alarms 4CS 636 Internetworking Spring 2009 Throughput: number of jobs per second ◦ Focus of industry, ISPs wish to maximize Latency: time to complete a job ◦ E.g. time for a packet to go from one end to another (RTT) ◦ Ideal interactive time 100ms 5CS 636 Internetworking Spring 2009 Backbone link speeds: ◦ Optical carrier (OC)-n n x 51.8Mbits ◦ OC48 ~ 2.5 Gbps, OC192~10Gbps TCP and web dominance : Web (70%) and TCP (90%) Small transfers: 50% of accessed files < 50Kbytes Latencies large, e.g. 100 ms for wide area ~ half of packets 40 bytes, many 1500 bytes ◦ Takes 8 ns to send 40 bytes at 40 Gbps 6CS 636 Internetworking Spring 2009 Networking (protocols) Hardware Architecture Operating systems 7CS 636 Internetworking Spring 2009 Combinatorial logic ◦ From transistors to gates ◦ Timing and power ◦ High level building blocks (standard cells) Memories ◦ Registers, SRAM, DRAM ◦ Fancy memory tricks (page mode, interleaved) 8CS 636 Internetworking Spring 2009 A function from digital inputs to outputs ◦ Wires, transistors, resistors, capacitors ◦ Gates: AND, NAND, NOT, etc. (60 ps) ◦ More complex blocks: adders, multiplexers Different designs for same boolean function offer different tradeoffs ◦ Time for the output to stabilize ◦ Number of transistors ◦ Power dissipation P=CV2F 9 Input I and outputs O are N-bit vectors such that ◦ O[j] = 1 iff I[j] = 1 and I[k] = 0, for all k<j ◦ Find first one, represent in one-hot notation Design1: Implement using N AND gates where each gate takes all 1 to N inputs. ◦ Space - O(N2) since each AND gate needs O(N) transistors ◦ Time - O(1) CS 636 Internetworking Spring 2009 10 Design 2: Every output bit requires the AND of complment of first j-1 bits ◦ Construct recursively P[j] = P[j-1]. ~I[j] ◦ Using two N two input AND gates in series ◦ O(N) transistors but takes O(N) Design 3: Tradeoff design. Use balanced binary tree to compute P[j] and combine partial results. CS 636 Internetworking Spring 2009 11 Build four input mux from 2 input muxes. CS 636 Internetworking Spring 2009 12 PPE: like a priority encoder but with rotating priority ◦ Find first 1 beyond P Arises in switch arbitration Design 1: ◦ Use a barrel shifter to shift I to the left by P bits. ◦ Apply PE ◦ Shift back by P bits. Two shifts + PE ~ 3 log N gate delays. ◦ Each barrel shift using tree of 2-muxes ~ log (N) CS 636 Internetworking Spring 2009 13 Smarter scheme due to Gupta/McKeown First check if any bit set at or after P If yes, apply PE on bits after P If no, apply PE on bits before P Used in Tiny Tera and Abrizio Tested using TI libraries ◦ 2x fast and 3x less area ◦ For a 32 port router. CS 636 Internetworking Spring 2009 14 Router forwarding done using logic but packets stored in memories Memory are significant bottlenecks ◦ Slower than logic delays Different subsystems require different types ◦ 200ms worth of packet buffering (DRAM) ◦ <1Mbyte of fast SRAM for forwarding tables Need different models CS 636 Internetworking Spring 2009 15 Need to store bit such that it stays indefinitely (absence of writes/power failures) Register is ordered collection of flip-flops Access from logic to a register ~ 0.5-1ns CS 636 Internetworking Spring 2009 16 N registers addressed by log N bits Addresses Self refreshing. Needs a decoder to translate A to unary so add decode delay. On chip SRAM 1-2ns, Off-chip 5-10ns. CS 636 Internetworking Spring 2009 17 SRAM flip-flop needs 5 transistors. DRAM needs one transistor plus capacitor (so more dense) Leakage fixed by refresh. Slower because output not driven by the power supply as in SRAM 40-60ns CS 636 Internetworking Spring 2009 18CS 636 Internetworking Spring 2009 Page mode DRAM (latency 40-60ns) Direct decoding hard: use divide and conquer ◦ Decode in stages, row first and column next ◦ Reduces decoder from O(N) to O(√N) gates Accesses within rows do not require Row address strobe. 19CS 636 Internetworking Spring 2009 Increase DRAM throughput (same latency) While Bank 1 works, send addresses to Bank 2, etc. SDRAM uses 2 banks, RDRAM uses 16 banks. If consecutive accesses to different banks, bandwidth increased by a factor of B 20CS 636 Internetworking Spring 2009 Example: Router with five 10GBps links Overall buffering – 200ms * 50Gbps Want to use DRAM ◦ Bw required 2 * 50 = 100 Gbits/sec + overhead ~ 200 Gbps Use single RDRAM with 16 banks ◦ Peak bw of 1.6 Gbytes or 13Gbps Accessing each RDRAM requires 64 pins for data, 25 for address/control ~90 ◦ 200 Gbps memory requires 16 RDRAMs ~ 1440 pins Chips have pin limtiations (typically <1000pins) Thus requires atleast one more chip. 21 Box architect partitions functions between chips Design teams write specifications, logic designers write RTL using verilog Synthesize gates to generate circuits. Timing checks to change design. Finally chip tapes out, manufactured, yield inspected Easy to add
View Full Document