CS250 VLSI Systems Design Lecture 10 Patterns for Processing Units and Communication Links John Wawrzynek Krste Asanovic with John Lazzaro and Brian Zimmer TA UC Berkeley Fall 2011 Lecture 10 Processor Patterns CS250 UC Berkeley Fall 2011 Unit Transaction Level UTL A UTL design s functionality is specified as sequences of atomic transactions performed at each unit affecting only local state and I O of unit i e serializable can reach any legal state by single stepping entire system one transaction at a time High level UTL spec admits various mappings into RTL with various cycle timings and overlap of transactions executions T3 T5 Network Unit Unit T2 Memory Lecture 10 Processor Patterns T1 T4 Unit Unit 2 Unit CS250 UC Berkeley Fall 2011 Transactional Specification of Unit Network Scheduler Trans1 1 Trans Trans 1 Trans 1 Transaction Architectural State Memory Each transaction has a combinational guard function defined over local state and state of I O indicating when it can fire e g only fire when head of input queue present and of certain type Transaction mutates local state and performs I O when it fires Scheduler is combinational function that picks next ready transaction to fire Lecture 10 Processor Patterns 3 CS250 UC Berkeley Fall 2011 Architectural State The architectural state of a unit is that which is visible from outside the unit through I O operations i e architectural state is part of the spec this is the target for black box testing When a unit is refined into RTL there will usually be additional microarchitectural state that is not visible from outside Intra transaction sequencing logic Pipeline registers Internal caches and or buffers this is the target for white box testing Lecture 10 Processor Patterns 4 CS250 UC Berkeley Fall 2011 UTL Example Route Lookup Table Replies Table Access Table Write Scheduler Table Read Packet Output Queues Route Lookup Table Packet Input Transactions in decreasing scheduler priority Table Write request on table access queue Writes a given 12 bit value to a given 12 bit address Table Read request on table access queue Reads a 12 bit value given a 12 bit address puts response on table reply queue Route request on packet input queue Looks up header in table and places routed packet on correct output queue This level of detail is all the information we really need to understand what the unit is supposed to do Everything else is implementation Lecture 10 Processor Patterns 5 CS250 UC Berkeley Fall 2011 Refining Route Lookup to RTL Reorder Buffer Table Access Table Replies Control Trie Lookup Pipeline Packet Input Packet Output Queues Lookup RAM The reorder buffer the trie lookup pipeline s registers and any control state are microarchitectural state that should not affect function as viewed from outside Implementation must ensure atomicity of UTL transactions Reorder buffer ensures packets flow through unit in order Must also ensure table write doesn t appear to happen in middle of packet lookup e g wait for pipeline to drain before performing write Lecture 10 Processor Patterns 6 CS250 UC Berkeley Fall 2011 System Design Goal Rate Balancing Network On Chip Memory Off Chip Memory System performance limited by application requirements on chip performance off chip I O or power energy Want to balance throughput of all units processing memory networks so none too fast or too slow Lecture 10 Processor Patterns 7 CS250 UC Berkeley Fall 2011 Rate Balancing Patterns To make unit faster use parallelism Unrolling for processing units Banking for memories Multiporting for memories Widen links for networks I e Use more resources by expanding in space shrinking in time To make unit slower use time multiplexing Replace dedicated links with a shared bus for networks Replace dedicated memories with a common memory Replace multiport memory with multiple cycles on single port Multithread computations onto a common pipeline Schedule a dataflow graph onto a single ALU I e Use less resources by shrinking in space expanding in time Lecture 10 Processor Patterns 8 CS250 UC Berkeley Fall 2011 Stateless Stream Unit Unrolling Stream is an ordered sequence Problem A stateless unit processing a single input stream of requests has insufficient throughput Solution Replicate the unit and stripe requests across the parallel units Aggregate the results from the units to form a response stream Applicability Stream unit does not communicate values between independent requests Consequences Requires additional hardware for replicated units plus networks to route requests and collect responses Latency and energy for each individual request increases due to additional interconnect cost Lecture 10 Processor Patterns 9 CS250 UC Berkeley Fall 2011 Stateless Stream Unit Unrolling T1 T2 T3 T4 T1 Time Lecture 10 Processor Patterns T4 T2 Collect Distribute T1 T3 Time 10 CS250 UC Berkeley Fall 2011 Variable Latency Stateless Stream Unit Unrolling Problem A stateless stream unit processing a single input stream of requests has insufficient throughput and each request takes a variable amount of time to process Solution Replicate the unit Allocate space in output reorder buffer in stream order then dispatch request to next available unit Unit writes result to allocated slot in output reorder buffer when completed possibly out of order but results can only be removed in stream order Applicability Stream unit does not communicate values between independent requests Consequences Additional hardware for replicated units plus added scheduler buffer and interconnects Need scheduler to find next free unit and possibly an arbiter for reorder buffer write ports Latency and energy for each individual request increases due to additional buffers and interconnect Lecture 10 Processor Patterns 11 CS250 UC Berkeley Fall 2011 Variable Latency Stateless Stream Unit Unrolling T1 T2 T3 T4 T1 Time T1 T2 T4 Lecture 10 Processor Patterns Arbiter Dispatch Scheduler T3 12 Reorder Buffer Time CS250 UC Berkeley Fall 2011 Time Multiplexing Problem Too much hardware used by several units processing independent transactions Solution Provide only a single unit and time multiplex hardware within unit to process independent transactions Applicability Original units have similar functionality and required throughput is low Consequences Combined unit has to provide superset of functionality of original units Combined unit has to provide architectural state for all architectural state in original units microarchitectural
View Full Document