Berkeley COMPSCI 258 - Distributed Microarchitectural Protocols - D2973247

Home> Schools> University of California, Berkeley> Computer Science (COMPSCI) > COMPSCI 258> Distributed Microarchitectural Protocols

DOC PREVIEW

Berkeley COMPSCI 258 - Distributed Microarchitectural Protocols

School name University of California, Berkeley

Course Compsci 258- Parallel Processors

Pages 12

This preview shows page 1-2-3-4 out of 12 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 12 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 12 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 12 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 12 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 12 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Appears in the39thAnnual International Symposium on MicroarchitectureDistributed Microarchitectural Protocols in the TRIPS Prototype ProcessorKarthikeyan Sankaralingam Ramadass Nagarajan Robert McDonaldRajagopalan DesikanySaurabh Drolia M.S. Govindan Paul GratzyDivya GulatiHeather HansonyChangkyu Kim Haiming Liu Nitya RanganathanSimha Sethumadhavan Sadia SharifyPremkishore ShivakumarStephen W. Keckler Doug BurgerDepartment of Computer SciencesyDepartment of Electrical and Computer EngineeringThe University of Texas at [email protected] www.cs.utexas.edu/users/cartAbstractGrowing on-chip wire delays will cause many future mi-croarchitectures to be distributed, in which hardware re-sources within a single processor become nodes on one ormore switched micronetworks. Since large processor coreswill require multiple clock cycles to traverse, control mustbe distributed, not centralized. This paper describes thecontrol protocols in the TRIPS processor, a distributed, tiledmicroarchitecture that supports dynamic execution. It de-tails each of the five types of reused tiles that compose theprocessor, the control and data networks that connect them,and the distributed microarchitectural protocols that imple-ment instruction fetch, execution, flush, and commit. Wealso describe the physical design issues that arose whenimplementing the microarchitecture in a 170M transistor,130nm ASIC prototype chip composed of two 16-wide is-sue distributed processor cores and a distributed 1MB non-uniform (NUCA) on-chip memory system.1 IntroductionGrowing on-chip wire delays, coupled with complex-ity and power limitations, have placed severe constraintson the issue-width scaling of centralized superscalar ar-chitectures. Future wide-issue processors are likely to betiled [23], meaning composed of multiple replicated, com-municating design blocks. Because of multi-cycle commu-nication delays across these large processors, control mustbe distributed across the tiles.For large processors, routing control and data amongthe tiles can be implemented with microarchitectural net-works (or micronets). Micronets provide high-bandwidth,flow-controlled transport for control and/or data in a wire-dominated processor by connecting the multiple tiles, whichare clients on one or more micronets. Higher-level mi-croarchitectural protocols direct global control across themicronets and tiles in a manner invisible to software.In this paper, we describe the tile partitioning, micronetconnectivity, and distributed protocols that provide globalservices in the TRIPS processor,including distributed fetch,execution, flush, and commit. Prior papers have describedthis approach to exploiting parallelism as well as high-levelperformance results [15, 3], but have not described the inter-tile connectivity or protocols. Tiled architectures such asRAW [23] use static orchestration to manage global oper-ations, but in a dynamically scheduled, distributed archi-tecture such as TRIPS, hardware protocols are required toprovide the necessary functionality across the processor.To understand the design complexity, timing, area, andperformance issues of this dynamic tiled approach, we im-plemented the TRIPS design in a 170M transistor, 130 nmASIC chip. This prototype chip contains two processorcores, each of which implements an EDGE instruction setarchitecture [3], is up to 4-way multithreaded, and can exe-cute a peak of 16 instructions per cycle. Each processor corecontains 5 types of tiles communicating across 7 micronets:one for data, one for instructions, and five for control usedto orchestrate distributed execution. TRIPS prototype tilesrange in size from 1-9mm2. Four of the principal processorelements (instruction and data caches, register files, and ex-ecution units) are each subdivided into replicated copies oftheir respective tile type–for example, the instruction cacheis composed of 5 instruction cache tiles, while the compu-tation core is composed of 16 execution tiles.The tiles are sized to be small enough so that wire de-lay within the tile is less than one cycle, so can largely beignored from a global perspective. Each tile interacts onlywith its immediate neighbors through the various micronets,which have roles such as transmitting operands betweeninstructions, distributing instructions from the instructioncache tiles to the execution tiles, or communicating controlmessages from the program sequencer. By avoiding anyglobal wires or broadcast busses–other than the clock, re-set tree, and interrupt signals–this design is inherently scal-able to smaller processes, and is less vulnerable to wire de-lays than conventional designs. Preliminary performanceresults on the prototype architecture using a cycle-accuratesimulator show that compiled code outperforms an Alpha21264 on half of the benchmarks; and we expect these re-sults to improve as the TRIPS compiler and optimizationsare tuned. Hand optimization of the benchmarks producesIPCs ranging from 1.5–6.5 and performance relative to Al-pha of 0.6–8.2 ISA Support for Distributed ExecutionExplicit Data Graph Execution (EDGE) architectureswere conceived with the goal of high-performance, single-threaded, concurrent but distributed execution, by allowingcompiler-generated dataflow graphs to be mapped to an exe-cution substrate by the microarchitecture. The two definingfeatures of an EDGE ISA are block-atomic execution anddirect communication of instructions within a block, whichtogether enable efficient dataflow-like execution.The TRIPS ISA is an example of an EDGE architecture,which aggregates up to 128 instructions into a single blockthat obeys the block-atomic execution model, in which ablock is logically fetched, executed, and committed as a sin-gle entity. This model amortizes the per-instruction book-keeping over a large number of instructions and reducesthe number of branch predictions and register file accesses.Furthermore, this model reduces the frequency at whichcontrol decisions about what to execute must be made (suchas fetch or commit), providing the additional latency toler-ance to make more distributed execution practical.2.1 Partitioning TRIPS Blo ksThe compiler constructs TRIPS blocks and assigns eachinstruction to a location within the block. Each block is di-vided into between two and five 128-byte chunks by the mi-croarchitecture. Every block includes a header chunk whichencodes up to 32 read and up to 32 write instructionsthat access the 128

View Full Document