Princeton ELE 572 - Dynamic Multi threading Processor - D2068554

Home> Schools> Princeton University> Electrical Engineering (ELE) > ELE 572> Dynamic Multi threading Processor

Princeton ELE 572 - Dynamic Multi threading Processor

Course Ele 572- Processor Architectures for New Paradigms

Pages 11

Download Save

Unformatted text preview:

A Dynamic Multithreading Processor Haitham Akkary Microcomputer Research Labs Michael A. Driscoll Department of Electrical and Computer Engineering Intel Corporation Portland State University [email protected] [email protected] Abstract We present an architecture that features dynamic multithreading execution of a single program. Threads are created automatically by hardware at procedure and loop boundaries and executed speculatively on a simultaneous multithreading pipeline. Data prediction is used to alleviate dependency constraints and enable lookahead execution of the threads. A two-level hierarchy significantly enlarges the instruction window. Eficient selective recovery from the second level instruction window takes place after a mispredicted input to a thread is corrected. The second level is slower to access but has the advantage of large storage capacity. We show several advantages of this architecture: (1) it minimizes the impact of ICache misses and branch mispredictions by fetching and dispatching instructions out-of-order, (2) it uses a novel value prediction and recovery mechanism to reduce artt$cial data dependencies created by the use of a stack to manage run-time storage, and (3) it improves the execution throughput of a superscalar by 15% without increasing the execution resources or cache bandwidth, and by 30% with one additional ICache fetch port. The speedup was measured on the integer SPEC95 benchmarks, without any compiler support, using a detailed peqormance simulator, 1 Introduction Today’s out-of-order superscalars use techniques such as register renaming and dynamic scheduling to eliminate hazards created by the reuse of registers, and to hide long execution latencies resulting from DCache misses and floating point operations [l]. However, the basic method of sequential fetch and dispatch of instructions is still the underlying computational model. Consequently, the performance of superscalars is limited by instruction supply disruptions caused by branch mispredictions and ICache misses. On programs where these disruptions occur often, the execution throughput is well below a wide superscalar’s peak bandwidth. Ideally, we need an unintermpted instruction fetch supply to increase performance. Even then, there are other complexities that have to be overcome to increase execution throughput [2]. Register renaming requires dependency checking among instructions of the same block, and multiple read ports into the rename table. This logic increases in complexity as the width of the rename stage increases. A large pool of instructions is also necessary to find enough independent instructions to run the execution units at full utilization. The issue logic has to identify independent instructions quickly, as soon as their inputs become ready, and issue them to the execution units. We present an architecture that improves instruction supply and allows instruction windows of thousands of instructions. The architecture uses dynamic multiple threads (DMT) of control to fetch, rename, and dispatch instructions simultaneously from different locations of the same program into the instruction window. In other words, instructions are fetched out-of-order. Fetching using multiple threads has three advantages. First, due to the frequency of branches in many programs, it is easier to increase the instruction supply by fetching multiple small blocks simultaneously than by increasing the size of the fetch block. Second, when the supply from one thread is interrupted due to an ICache miss or a branch m&prediction, the other threads will continue filling the instruction window. Third, although duplication of the ICache fetch port and the rename unit is necessary to increase total fetch bandwidth, dependency checks of instructions within a block and the number of read ports into a rename table entry do not increase in complexity. In order to enlarge the instruction pool without creating too much complexity in the issue logic, we have designed a hierarchy of instruction windows. One small window is tightly coupled with the execution units. A conventional physical register file or reorder buffer can be used for this level. A much larger set of instruction buffers are located outside the execution pipeline. These buffers are slower to access, but can store many more instructions. The hardware breaks up a program automatically into loops and procedure threads that 226 0-8186-8609-x/98 $10.00 0 1998 IEEEexecute simultaneously on the superscalar processor. Data speculation on the inputs to a thread is used to allow new threads to start execution immediately. Otherwise, a thread may quickly stall waiting for its inputs to be computed by other threads. Although the instruction fetch, dispatch, and execution is out of order, instructions are reordered after they complete execution and all mispredictions, including branch and data, are corrected. Results are then committed in order. 1.1 Related work Many of the concepts in this paper have roots in recent research on multithreading and high performance processor architectures. The potential for achieving a significant increase in throughput on a superscalar by using simultaneous multithreading (SMT) was first demonstrated in [3]. SMT is a technique that allows multiple independent threads or programs to issue multiple instructions to a superscalar’s functional units. In SMT all thread contexts are active simultaneously and compete for all execution resources. Separate program counters, rename tables, and retirement mechanisms are provided for the running threads, but caches, instruction queues, the physical register file and the execution units are simultaneously shared by all threads. SMT has a cost advantage over multiple processors on a single chip due to its capability to dynamically assign execution resources

View Full Document


School:
Email:
New Password:
Confirm Password:

Princeton ELE 572 - Dynamic Multi threading Processor

Sign up for free to view:

Please select your school