Unformatted text preview:

Copyright 1996 IEEE Published in the Proceedings of the 29th Annual ACM IEEE International Symposium on Microarchitecture Dec 2 4 1996 Paris France Personal use of this material is permitted However permissions to reprint republish this material for advertising or promotional purposes or for creating new collective works for resale or distribution to servers or lists or to reuse any copyrighted component of this work in other works must be obtained from the IEEE Contact Manager Copyright and Permissions IEEE Service Center 445 Hoes Lane P O Box 1331 Piscataway NJ 08855 1331 USA Telephone Intl 908 562 3966 Exceeding the Dataflow Limit via Value Prediction Mikko H Lipasti and John Paul Shen Department of Electrical and Computer Engineering Carnegie Mellon University Pittsburgh PA 15213 mhl shen ece cmu edu Abstract For decades the serialization constraints imposed by true data dependences have been regarded as an absolute limit the dataflow limit on the parallel execution of serial programs This paper proposes a new technique value prediction for exceeding that limit that allows data dependent instructions to issue and execute in parallel without violating program semantics This technique is built on the concept of value locality which describes the likelihood of the recurrence of a previously seen value within a storage location inside a computer system Value prediction consists of predicting entire 32 and 64 bit register values based on previously seen values We find that such register values being written by machine instructions are frequently predictable Furthermore we show that simple microarchitectural enhancements to a modern microprocessor implementation based on the PowerPC 620 that enable value prediction can effectively exploit value locality to collapse true dependences reduce average result latency and provide performance gains of 4 5 23 depending on machine model by exceeding the dataflow limit 1 Motivation and Related Work There are two fundamental restrictions that limit the amount of instruction level parallelism ILP that can be extracted from sequential programs control flow and data flow Control flow limits ILP by imposing serialization constraints at forks and joins in a program s control flow graph 1 Data flow limits ILP by imposing serialization constraints on pairs of instructions that are data dependent i e one needs the result of another to compute its own result and hence must wait for the other to complete before beginning to execute Examining the extent and effect of these limits has been a popular and important area of research particularly in the case of control flow 2 3 4 5 Continuing advances in the development of accurate branch predictors e g 6 have led to increasingly aggressive control speculative microarchitectures e g the Intel Pentium Pro 7 which undertake aggressive measures to overcome controlflow restrictions by using branch prediction and speculative execution to bypass control dependences and expose additional instruction level parallelism to the microarchitecture Meanwhile numerous mechanisms have been proposed and implemented to eliminate false data dependences and tolerate the latencies induced by true data dependences by allowing instructions to execute out of program order see 8 for an overview Surprisingly in light of the extensive energies focused on eliminating control flow restrictions on parallel instruction issue less attention has been paid to eliminating data flow restrictions on parallel issue Recent work has focused primarily on reducing the latency of specific types of instructions usually loads from memory by rearranging pipeline stages 9 10 initiating memory accesses earlier 11 or speculating that dependences to earlier stores do not exist 12 13 14 15 The most relevant prior work in the area of eliminating data flow dependences consists of the Tree Machine 16 17 which uses a value cache to store and look up the results of recurring arithmetic expressions to eliminate redundant computation the value cache in effect performs common subexpression elimination 1 in hardware Richardson follows up on this concept in 18 by introducing the concepts of trivial computation which is defined as the trivialization of potentially complex operations by the occurrence of simple operands and redundant computation where an operation repeatedly performs the same computation because it sees the same operands He proposes a hardware mechanism the result cache which reduces the latency of such trivial or redundant complex arithmetic operations by storing and looking up their results in the result cache In 19 we introduced value locality a concept related to redundant computation and demonstrated a technique Load Value Prediction or LVP for predicting the results of load instructions at dispatch by exploiting the affinity between load instruction addresses and the values the loads produce LVP differs from Harbison s value cache and Richardson s result cache in two important ways first the LVP table is indexed by instruction address and hence value lookups can occur very early in the pipeline second it is speculative in nature and relies on a verification mechanism to guarantee correctness In contrast both Harbison and Richardson use table indices that are only available later in the pipeline Harbison uses data addresses while Richardson uses actual operand values and require their predictions to be correct hence requiring mechanisms for keeping their tables coherent with all other computation In this paper we extend the LVP approach for predicting the results of load instructions to apply to all instructions that write an integer or floating point register show that a significant proportion of such writes are trivially predictable describe a value prediction hardware mechanism that allows dependent instructions to execute in parallel and present results that demonstrate significant performance increases over our baseline machine models 2 Taxonomy of Speculative Execution In order to place our work on value prediction into a meaningful historical context we introduce a taxonomy of speculative execution This taxonomy summarized in Figure 1 categorizes ours as well as previously introduced techniques based on which types of dependences are being bypassed control vs data whether the speculation relates to storage location or value and what type of decision must be made to enable the speculation binary vs multi valued Speculative Execution Control


View Full Document

CMU CS 15740 - Exceeding the Dataflow Limit via Value Prediction

Documents in this Course
leecture

leecture

17 pages

Lecture

Lecture

9 pages

Lecture

Lecture

36 pages

Lecture

Lecture

9 pages

Lecture

Lecture

13 pages

lecture

lecture

25 pages

lect17

lect17

7 pages

Lecture

Lecture

65 pages

Lecture

Lecture

28 pages

lect07

lect07

24 pages

lect07

lect07

12 pages

lect03

lect03

3 pages

lecture

lecture

11 pages

lecture

lecture

20 pages

lecture

lecture

11 pages

Lecture

Lecture

9 pages

Lecture

Lecture

10 pages

Lecture

Lecture

22 pages

Lecture

Lecture

28 pages

Lecture

Lecture

18 pages

lecture

lecture

63 pages

lecture

lecture

13 pages

Lecture

Lecture

36 pages

Lecture

Lecture

18 pages

Lecture

Lecture

17 pages

Lecture

Lecture

12 pages

lecture

lecture

34 pages

lecture

lecture

47 pages

lecture

lecture

7 pages

Lecture

Lecture

18 pages

Lecture

Lecture

7 pages

Lecture

Lecture

21 pages

Lecture

Lecture

10 pages

Lecture

Lecture

39 pages

Lecture

Lecture

11 pages

lect04

lect04

40 pages

Load more
Loading Unlocking...
Login

Join to view Exceeding the Dataflow Limit via Value Prediction and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Exceeding the Dataflow Limit via Value Prediction and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?