Decoupled Architectures and Transaction-Level DesignToday’s Difficult Design ProblemFirst Complication: Output StallStall Fan-Out ExampleLoops Prevent Arbitrary Logic ResizingSecond Complication: Bubbles on InputLogic to Squeeze BubblesDecoupled Design DisciplineHardware Design Abstraction LevelsApplication to RTL in One Step?Unit-Transaction Level Design Discipline6.375 UTL DisciplineUTL OverviewUnit Architectural StateQueuesTransactionsSchedulerUTL Example: IP LookupRefining IP Lookup to RTLUTL & Architectural-Level VerificationUTL Helps Physical DesignDesign Template for Unit MicroarchitectureSkid BufferingImplementing Communication QueuesEnd-End Credit-Based Flow ControlDistributed Flow ControlBusesOn-Chip Network6.375 – Spring 2007, L15-Slide-1Decoupled Architectures andTransaction-Level Design6.375 – Spring 2007, L15-Slide-2Today’s Difficult Design Problem(For today’s lecture, we’ll assume clock distribution is not an issue)The humble shift register6.375 – Spring 2007, L15-Slide-3First Complication: Output StallShift register should only move data to right if output ready to accept next itemReadyWhat complication does this introduce?6.375 – Spring 2007, L15-Slide-4Stall Fan-Out Example200 bits per shift register stage, 16 stages3200 flip-flopsHow many FO4 delays to buffer up ready signal?ReadyEnableThis doesn’t include any penalty for driving enable signal wiring!6.375 – Spring 2007, L15-Slide-5Loops Prevent Arbitrary Logic ResizingWe could increase size of gates in ready logic block to reduce fan out required to drive ready signal to flop enables…BUT, ReadyReady LogicShift Register ModuleReceiving Module6.375 – Spring 2007, L15-Slide-6Second Complication: Bubbles on InputSender doesn’t have valid data every clock cycle, empty “bubbles” inserted into pipelineReadyWould like to “squeeze” bubbles out of pipelineValidStage 1Stage 2Stage 3Stage 4Time~Ready~Valid6.375 – Spring 2007, L15-Slide-7Logic to Squeeze BubblesCan move one stage to right if Ready asserted, or there is any bubble in stages to right of current stageReady?ValidEnable?Valid?Fan-in of number of valid signals grows with number of pipeline stagesFan-out of each stage’s valid signal also grows with number of pipeline stagesResults in slow combinational paths as number of pipeline stages grows6.375 – Spring 2007, L15-Slide-8Decoupled Design DisciplineThe shift register example is a simple abstraction that illustrates the control complexity problems of any large synchronous pipeline–Usually, there are even more complex interactions between stagesCombinational LogicCLKCombinational LogicTo avoid these problems (and many others), designers will use a decoupled design discipline, where moderate size synchronous units (~10-100K gates) are connected by decoupling FIFOs or channels6.375 – Spring 2007, L15-Slide-9Hardware Design Abstraction LevelsAlgorithmCircuitsApplicationGuarded Atomic Actions (Bluespec)Register-Transfer Level (Verilog RTL)DevicesUnit-Transaction Level (UTL) ModelGatesPhysicsToday’s Lecture6.375 – Spring 2007, L15-Slide-10Application to RTL in One Step?Modern hardware systems have complex functionality (graphics chips, video encoders, wireless communication channels), but sometimes designers try to map directly to an RTL cycle-level microarchitecture in one stepRequires detailed cycle-level design of each sub-unit–Significant design effort required before clear if design will meet goalsInteractions between units becomes unclear if arbitrary circuit connections allowed between units, with possible cycle-level timing dependencies–Increases complexity of unit specificationsRemoves degrees of freedom for unit designers–Reduces possible space for architecture explorationDifficult to document intended operation, therefore difficult to verify6.375 – Spring 2007, L15-Slide-11Unit-Transaction Level Design DisciplineModel design as messages flowing through FIFO buffers between units containing architectural stateEach unit can independently perform an operation, or transaction, that may consume messages, update local state, and send further messages Transaction and/or communication might take many cycles (i.e., not necessarily a single Bluespec rule)–Have to design RTL of unit microarchitecture during design refinementUnit 1Arch. StateArch. StateUnit 2Unit 3Arch. StateShared Memory Unit6.375 – Spring 2007, L15-Slide-126.375 UTL DisciplineVarious forms of transaction-level model are becoming increasingly used in commercial designsUTL (Unit-Transaction Level) models are the variant we’ll use in 6.375UTL forces a discipline on top-level design structure that will result in clean hardware designs that are easier to document and verify, and which should lead to better physical designs–A discipline restricts hardware designs, with the goal of avoiding bad choicesUTL specs can be easily implemented in C/C++/Java/SystemC/Bluespec EsePro to give a golden model for design verificationYou’re required to give an initial UTL description (in English text) of your project design by April 6 project milestone6.375 – Spring 2007, L15-Slide-13UTL OverviewUnit comprises:Architectural state (registers + RAMs)Input queues and output queues connected to other unitsTransactions (atomic operations on state and queues)Scheduler (combinational function to pick next transaction to run)TransactionsSchedulerInput queuesOutput queuesArch. StateUnit6.375 – Spring 2007, L15-Slide-14Unit Architectural StateArchitectural state is any state that is visible to an external agent–i.e, architectural state can be observed by sending strings of packets into input queues and looking at values returned at outputs.High-level specification of a unit only refers to architectural stateDetailed implementation of a unit may have additional microarchitectural state that is not visible externally–Intra-transaction sequencing logic–Pipeline registers–Caches/buffersArch. State6.375 – Spring 2007, L15-Slide-15QueuesQueues expose communication latency and decouple units’ executionQueues are point-to-point channels only–No fanout, a unit must replicate messages on multiple queues–No buses in a UTL designTransactions can only pop head of input queues and push at most one element onto each output queue–Avoids exposing size of buffers in queues–Also avoids synchronization
View Full Document