New version page

MIT 6 893 - Decoupled Architectures

Documents in this Course
Toolkits

Toolkits

16 pages

Cricket

Cricket

29 pages

Quiz 1

Quiz 1

8 pages

Security

Security

28 pages

Load more

This preview shows page 1-2-3-23-24-25-26-47-48-49 out of 49 pages.

View Full Document
View Full Document

End of preview. Want to read all 49 pages?

Upload your study docs or become a GradeBuddy member to access this document.

View Full Document
Unformatted text preview:

Decoupled Architectures for Complexity-Effective General Purpose ProcessorsMotivationProposalDecoupled Access/Execute ArchitectureSlide 5Simultaneous Multithreading with DAEDecoupled Control/Access/Execute ArchitectureSlide 8Slide 9Slide 10Slide 11Slide 12Slide 13Slide 14Slide 15Slide 16Slide 17Slide 18Slide 19Slide 20Slide 21Slide 22Slide 23Slide 24Decoupled vs. Superscalar ArchitecturesDecoupled Architectures for General Purpose ComputingMultithreading on a DCAE ArchitectureSlide 28Slide 29Slide 30Slide 31Slide 32Slide 33Slide 34Slide 35Slide 36Slide 37Slide 38Slide 39Slide 40Slide 41Slide 42Slide 43Slide 44Slide 45Slide 46Auxiliary Decoupled Access/Execute Streaming UnitsExtensions for Improved PerformanceSummaryDecoupled Architectures for Complexity-Effective General Purpose ProcessorsRonny Krashinsky and Mike Sung6.893 Term Project PresentationMIT Laboratory for Computer Science12-7-2000Motivationout-of-order superscalar designs are inefficient and hard to scaledecoupled architectures can provide latency hiding, dynamic scheduling, and ILP in a much more complexity-effective and scalable mannerin previous work, decoupled architectures have been investigated for scientific appssuperscalar architectures are used universally for general purpose computing requirementswhy? superscalars provide more flexibility, and decoupled architectures break down when there is a loss of decouplingProposaluse decoupled architectures for complexity-effective general purpose computingmultithreading can be used to hide loss of decoupling latencypotentially get the best out of both architectures by providing a superscalar processor with decoupled engines for complexity-effective streaming computationswe will present a survey of prior work and our proposed architectural innovations, unfortunately a lot of infrastructure (e.g. a compiler) is required for a more detailed investigationDecoupled Access/Execute ArchitectureDecoupled Access/Execute Computer Architectures, Smith, 1982AP & EP process separate instruction streamsEP used for computation (floating point)ILPdata values communicated via queuesslip – AP runs ahead of EPmemory latency hidingdynamic schedulinghead of AEQ can be used as instruction operand in EPblocks if data isn’t availabletakes the place of register renamingstore addresses wait in WAQ until corresponding data arrives from EPloads can bypass stores (check address)Decoupled Access/Execute ArchitectureDecoupled Access/Execute Computer Architectures, Smith, 1982program control flow implemented with corresponding conditional branch in each streambranch condition queues allow AP to hide branch latency from EPloss of decoupling if AP depends on branch condition from EPnot discussed in early worksimplemented in the Astronautics ZS-1 Processorsingle interleaved instruction stream is split to feed instruction queuescontrol flow instruction executed in the splitterSimultaneous Multithreading with DAEThe Synergy of Multithreading and Access/Execute Decoupling, Parcerisa and Gonzalez, 1998observation that functional unit latencies and true data dependencies in EP hinder performanceuse SMT and thread level parallelism to better utilize functional units (same as with SMT in superscalars)few threads are requireddecoupling provides memory latency tolerance, SMT hides functional unit latenciesDecoupled Control/Access/Execute ArchitectureThe Effectiveness of Decoupling, Bird et. al., 1993further optimization: control decouplingthree instruction streams, dynamic slipCP processes control flow graph, sends directives to AP and EP to execute basic blocks limited control capabilities in AP and EP: loop count and predicationfetch engines fill queues with valid instructionsdynamic loop unrollingcontrol latency hidden (without speculation)“stream units”CU can operate in stand-alone modeimplemented as a 21064, ran the OSDecoupled Control/Access/Execute Architectureloss of decoupling events cause breakdownThe Performance of Decoupled Architectures, Parcerisa et. al., 1996Decoupled Control/Access/Execute ArchitectureDecoupled Control/Access/Execute ArchitectureDecoupled Control/Access/Execute ArchitectureDecoupled Control/Access/Execute ArchitectureDecoupled Control/Access/Execute ArchitectureDecoupled Control/Access/Execute ArchitectureLOD!Decoupled Control/Access/Execute ArchitectureDecoupled Control/Access/Execute ArchitectureDecoupled Control/Access/Execute ArchitectureDecoupled Control/Access/Execute ArchitectureDecoupled Control/Access/Execute ArchitectureDecoupled Control/Access/Execute ArchitectureDecoupled Control/Access/Execute ArchitectureDecoupled Control/Access/Execute ArchitectureDecoupled Control/Access/Execute ArchitectureDecoupled Control/Access/Execute ArchitectureDecoupled vs. Superscalar ArchitecturesDynamic “out-of-order” execution with less complexityAllows non-speculative instruction and data prefetching. We can shrink data structures like first level caches, potentially reducing critical paths as well as reducing powerInherent long memory latency toleration – provides performance advantage for streaming applications, etc. where lack of locality mitigates performance advantages of cachesSimplified issue logic which can be implemented with small structures/queues (contrast with ROB/IW/bypass structures)Better resource utilization by partioning between CP/AP/DP, processors can have specialized ISAsScalability – direct consequence of simplified logicFor superscalar processors, need to increase IW which does not scale (Palacharla/Agawal papers)Decoupled machines alleviate centralized resource bottlenecksQueue-based structure is amenable to tiled architectures with on-chip networksDecoupled Architectures for General Purpose ComputingSo why haven’t decoupled machines taken over the world? Because superscalar architectures took over the world firstPrimary drawback of decoupled architectures from LOD events - “twisty” C code can cause severe performance degradationInability for compilers to program effectively for separate instruction streams – lack of research/development in the area of programming/compiling analysisWheel of Reincarnation: no such thing as a new idea…If we can augment existing decoupled architectures to remove the effects of LOD events, we effectively have an architecture that can


View Full Document
Loading Unlocking...
Login

Join to view Decoupled Architectures and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Decoupled Architectures and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?