Decoupled Architectures for Complexity-Effective General Purpose ProcessorsMotivationProposalDecoupled Access/Execute ArchitectureSlide 5Simultaneous Multithreading with DAEDecoupled Control/Access/Execute ArchitectureSlide 8Slide 9Slide 10Slide 11Slide 12Slide 13Slide 14Slide 15Slide 16Slide 17Slide 18Slide 19Slide 20Slide 21Slide 22Slide 23Slide 24Decoupled vs. Superscalar ArchitecturesDecoupled Architectures for General Purpose ComputingMultithreading on a DCAE ArchitectureSlide 28Slide 29Slide 30Slide 31Slide 32Slide 33Slide 34Slide 35Slide 36Slide 37Slide 38Slide 39Slide 40Slide 41Slide 42Slide 43Slide 44Slide 45Slide 46Auxiliary Decoupled Access/Execute Streaming UnitsExtensions for Improved PerformanceSummaryDecoupled Architectures for Complexity-Effective General Purpose ProcessorsRonny Krashinsky and Mike Sung6.893 Term Project PresentationMIT Laboratory for Computer Science12-7-2000Motivationout-of-order superscalar designs are inefficient and hard to scaledecoupled architectures can provide latency hiding, dynamic scheduling, and ILP in a much more complexity-effective and scalable mannerin previous work, decoupled architectures have been investigated for scientific appssuperscalar architectures are used universally for general purpose computing requirementswhy? superscalars provide more flexibility, and decoupled architectures break down when there is a loss of decouplingProposaluse decoupled architectures for complexity-effective general purpose computingmultithreading can be used to hide loss of decoupling latencypotentially get the best out of both architectures by providing a superscalar processor with decoupled engines for complexity-effective streaming computationswe will present a survey of prior work and our proposed architectural innovations, unfortunately a lot of infrastructure (e.g. a compiler) is required for a more detailed investigationDecoupled Access/Execute ArchitectureDecoupled Access/Execute Computer Architectures, Smith, 1982AP & EP process separate instruction streamsEP used for computation (floating point)ILPdata values communicated via queuesslip – AP runs ahead of EPmemory latency hidingdynamic schedulinghead of AEQ can be used as instruction operand in EPblocks if data isn’t availabletakes the place of register renamingstore addresses wait in WAQ until corresponding data arrives from EPloads can bypass stores (check address)Decoupled Access/Execute ArchitectureDecoupled Access/Execute Computer Architectures, Smith, 1982program control flow implemented with corresponding conditional branch in each streambranch condition queues allow AP to hide branch latency from EPloss of decoupling if AP depends on branch condition from EPnot discussed in early worksimplemented in the Astronautics ZS-1 Processorsingle interleaved instruction stream is split to feed instruction queuescontrol flow instruction executed in the splitterSimultaneous Multithreading with DAEThe Synergy of Multithreading and Access/Execute Decoupling, Parcerisa and Gonzalez, 1998observation that functional unit latencies and true data dependencies in EP hinder performanceuse SMT and thread level parallelism to better utilize functional units (same as with SMT in superscalars)few threads are requireddecoupling provides memory latency tolerance, SMT hides functional unit latenciesDecoupled Control/Access/Execute ArchitectureThe Effectiveness of Decoupling, Bird et. al., 1993further optimization: control decouplingthree instruction streams, dynamic slipCP processes control flow graph, sends directives to AP and EP to execute basic blocks limited control capabilities in AP and EP: loop count and predicationfetch engines fill queues with valid instructionsdynamic loop unrollingcontrol latency hidden (without speculation)“stream units”CU can operate in stand-alone modeimplemented as a 21064, ran the OSDecoupled Control/Access/Execute Architectureloss of decoupling events cause breakdownThe Performance of Decoupled Architectures, Parcerisa et. al., 1996Decoupled Control/Access/Execute ArchitectureDecoupled Control/Access/Execute ArchitectureDecoupled Control/Access/Execute ArchitectureDecoupled Control/Access/Execute ArchitectureDecoupled Control/Access/Execute ArchitectureDecoupled Control/Access/Execute ArchitectureLOD!Decoupled Control/Access/Execute ArchitectureDecoupled Control/Access/Execute ArchitectureDecoupled Control/Access/Execute ArchitectureDecoupled Control/Access/Execute ArchitectureDecoupled Control/Access/Execute ArchitectureDecoupled Control/Access/Execute ArchitectureDecoupled Control/Access/Execute ArchitectureDecoupled Control/Access/Execute ArchitectureDecoupled Control/Access/Execute ArchitectureDecoupled Control/Access/Execute ArchitectureDecoupled vs. Superscalar ArchitecturesDynamic “out-of-order” execution with less complexityAllows non-speculative instruction and data prefetching. We can shrink data structures like first level caches, potentially reducing critical paths as well as reducing powerInherent long memory latency toleration – provides performance advantage for streaming applications, etc. where lack of locality mitigates performance advantages of cachesSimplified issue logic which can be implemented with small structures/queues (contrast with ROB/IW/bypass structures)Better resource utilization by partioning between CP/AP/DP, processors can have specialized ISAsScalability – direct consequence of simplified logicFor superscalar processors, need to increase IW which does not scale (Palacharla/Agawal papers)Decoupled machines alleviate centralized resource bottlenecksQueue-based structure is amenable to tiled architectures with on-chip networksDecoupled Architectures for General Purpose ComputingSo why haven’t decoupled machines taken over the world? Because superscalar architectures took over the world firstPrimary drawback of decoupled architectures from LOD events - “twisty” C code can cause severe performance degradationInability for compilers to program effectively for separate instruction streams – lack of research/development in the area of programming/compiling analysisWheel of Reincarnation: no such thing as a new idea…If we can augment existing decoupled architectures to remove the effects of LOD events, we effectively have an architecture that can
View Full Document