Meanwhile, numerous mechanisms have been proposed andimplemented to eliminate false data dependences and toler-ate the latencies induced by true data dependences by allow-ing instructions to execute out of program order (see [8] foran overview).Surprisingly, in light of the extensive energies focused oneliminating control-flow restrictions on parallel instructionissue, less attention has been paid to eliminating data-flowrestrictions on parallel issue. Recent work has focused pri-marily on reducing the latency of specific types of instruc-tions (usually loads from memory) by rearranging pipelinestages [9, 10], initiating memory accesses earlier [11], orspeculating that dependences to earlier stores do not exist[12, 13, 14, 15].The most relevant prior work in the area of eliminatingdata-flow dependences consists of the Tree Machine[16,17], which uses a value cache to store and look up theresults of recurring arithmetic expressions to eliminateredundant computation (the value cache, in effect, performscommon subexpression elimination [1] in hardware). Rich-ardson follows up on this concept in [18] by introducing theconcepts of trivial computation, which is defined as the triv-ialization of potentially complex operations by the occur-rence of simple operands; and redundant computation,where an operation repeatedly performs the same computa-tion because it sees the same operands. He proposes a hard-ware mechanism (the result cache) which reduces thelatency of such trivial or redundant complex arithmeticoperations by storing and looking up their results in theresult cache. In [19], we introduced value locality, a conceptrelated to redundant computation, and demonstrated a tech-nique--Load Value Prediction, or LVP--for predicting theresults of load instructions at dispatch by exploiting theaffinity between load instruction addresses and the valuesthe loads produce. LVP differs from Harbison’s value cacheand Richardson’s result cache in two important ways: first,the LVP table is indexed by instruction address, and hencevalue lookups can occur very early in the pipeline; second,it is speculative in nature, and relies on a verification mech-anism to guarantee correctness. In contrast, both Harbisonand Richardson use table indices that are only available laterin the pipeline (Harbison uses data addresses, while Rich-ardson uses actual operand values); and require their predic-tions to be correct, hence requiring mechanisms for keepingExceeding the Dataflow Limit via Value PredictionMikko H. Lipasti and John Paul ShenDepartment of Electrical and Computer EngineeringCarnegie Mellon UniversityPittsburgh PA, 15213{mhl,shen}@ece.cmu.eduAbstractFor decades, the serialization constraints imposed bytrue data dependences have been regarded as an absolutelimit--the dataflow limit--on the parallel execution of serialprograms. This paper proposes a new technique--value pre-diction--for exceeding that limit that allows data dependentinstructions to issue and execute in parallel without violat-ing program semantics. This technique is built on the con-cept of value locality, which describes the likelihood of therecurrence of a previously-seen value within a storage loca-tion inside a computer system. Value prediction consists ofpredicting entire 32- and 64-bit register values based onpreviously-seen values. We find that such register valuesbeing written by machine instructions are frequently pre-dictable. Furthermore, we show that simple microarchitec-tural enhancements to a modern microprocessorimplementation based on the PowerPC 620 that enablevalue prediction can effectively exploit value locality to col-lapse true dependences, reduce average result latency, andprovide performance gains of 4.5%-23% (depending onmachine model) by exceeding the dataflow limit.1. Motivation and Related WorkThere are two fundamental restrictions that limit theamount of instruction level parallelism (ILP) that can beextracted from sequential programs: control flow and dataflow. Control flow limits ILP by imposing serialization con-straints at forks and joins in a program’s control flow graph[1]. Data flow limits ILP by imposing serialization con-straints on pairs of instructions that are data dependent (i.e.one needs the result of another to compute its own result, andhence must wait for the other to complete before beginningto execute). Examining the extent and effect of these limitshas been a popular and important area of research, particu-larly in the case of control flow [2,3,4,5]. Continuingadvances in the development of accurate branch predictors(e.g. [6]) have led to increasingly-aggressive control-spec-ulative microarchitectures (e.g. the Intel Pentium Pro [7]),which undertake aggressive measures to overcome control-flow restrictions by using branch prediction and speculativeexecution to bypass control dependences and expose addi-tional instruction-level parallelism to the microarchitecture.Copyright 1996 IEEE. Published in the Proceedings of the 29th Annual ACM/IEEE International Symposium on Microarchitecture,Dec. 2-4, 1996, Paris, France. Personal use of this material is permitted. However, permissions to reprint/republish this material foradvertising or promotional purposes or for creating new collective works for resale or distribution to servers or lists, or to reuse anycopyrighted component of this work in other works, must be obtained from the IEEE. Contact: Manager, Copyright and Permissions/IEEE Service Center / 445 Hoes Lane / P.O. Box 1331 / Piscataway, NJ 08855-1331, USA. Telephone: +Intl. 908-562-3966.their tables coherent with all other computation.In this paper, we extend the LVP approach for predictingthe results of load instructions to apply to all instructions thatwrite an integer or floating point register; show that a sig-nificant proportion of such writes are trivially predictable;describe a value prediction hardware mechanism that allowsdependent instructions to execute in parallel; and presentresults that demonstrate significant performance increasesover our baseline machine models.2. Taxonomy of Speculative ExecutionIn order to place our work on value prediction into ameaningful historical context, we introduce a taxonomy ofspeculative execution. This taxonomy, summarized inFigure 1, categorizes ours as well as previously-introducedtechniques based on which types of dependences are beingbypassed (control vs. data), whether the speculation relatesto storage location or value, and
View Full Document