The CAE Architecture: Decoupled Program Control for Energy-Efficient Performance Ronny Krashinsky and Michael SungMotivation for Decoupled ArchitecturesPrior WorkDecoupled Program Control: CAE ArchitectureDecoupled Program Control BenefitsIssues with Decoupled Program ControlProgress and RoadmapThe CAE Architecture: Decoupled Program Control for Energy-Efficient Performance Ronny Krashinsky and Michael SungChange in project direction from original proposalinitial idea of energy-efficient program control used a lot of existing ideas alreadyNew idea of combining existing work in decoupled architectures to simplify program control and issue logic for high-performance microprocessorsBasic idea is that program control is inherently separate from other types of instructions (access and execute). We propose to decouple the program control from the rest of the instruction stream(s) to uncover ILPMotivation for Decoupled ArchitecturesIncreased performance by exploiting fine-grained parallelism between access and execute functionsDecoupling access from execution allows access processor to run ahead or “slip” w.r.t. execution processor: dynamic reorderingMemory latency tolerationDynamic loop unrolling (exposing ILP between loop iterations, hide functional unit latencies by overlapping executions of diff. iterations) Simplified issue/decode logicMuch simpler than complex superscalar architectures (IW, ROB, bypass)Scalability – direct consequence of simplified logicFor superscalar processors, need to increase IW which does not scale (Palacharla/Agawal papers)For decoupled machines, simply lengthen queues to allow more “slip”Prior WorkSeparation of access and execute functionsIBM 360/370, CDC 6600, CDC 7600, CRAY-1Explicit partition of access and computation functionsJ. Smith, PIPE (compile time splitting)G. Tyson, MISC (Multiple Inst. Stream Computer), descendant of PIPEA. Pleszkun, SMA (Structured Memory Access) architectureJ. Smith, Astronautics Corporation’s ZS-1 (splits fix-point/addressing from floating-point operations)A. Wulf, WM architecture IBM’s FOM (FORTRAN Oriented Machine)Shares similarities with trace processorsTechnically, superscalar machines have a slightly decoupled natureDecoupled Program Control: CAE ArchitectureHierarchy of decoupling: 3 levels of decoupling (Control, Access, and Execute). Control flow is most elastic, providing ample instructions for both access and execute pipelines MEM A$ C$ E$ AP CP EPDecoupled Program Control BenefitsProgram control flow can be easily determined without waiting Provides “out-of-order” execution without complexityInherits the memory latency toleration of DAE architecturesSimplified issue logic Can be implemented with small structures/queues Allows non-speculative instruction prefetchingBecause of prefetching, we can shrink data structures like caches, potentially reducing critical paths as well as reducing powerStill provides performance advantage for streaming applications, etc. where lack of locality mitigates performance advantages of cachesIssues with Decoupled Program Control Deadlocking with queuesQueue size determines how much slip can occurDraining queues or explicit queue manipulation (push/pop) instructions Performance issues from feedback of values from execution/ access pipelines that program control depends onDependencies limit how far the program pipeline can slip in front of the access and execute pipelines, etc.Likewise, feedback dependencies from execute to access pipelinesComplexities in queue interactions (correctness, verification, ease of programming)Basically an issue of how to to synchronize instruction streams correctlyProgress and RoadmapCompletedLiterature search of decoupled architecturesInitial ISA exploration and microarchitectural development for proposed CAE architectureRoadmapClassifying control flow and dependencies in applicationsISA development for each instruction stream (control, access, execute)Complete architectural specification of CAE architecture (TRS)Implementation of RTL-level simulator (SyCHOSys) Simulation and performance analysis
View Full Document