Rice ELEC 525 - Partitioning Register File to Reduce Access Time

Unformatted text preview:

Partitioning Register File to Reduce Access TimeHyong-Youb Kim, Julie Rosser,Kyle Bryson, Supratik Majumder1 IntroductionIn a wide superscalar processor, the amount of time it takes to execute an application depends on theinstruction latency and the amount of instruction level parallelism (ILP) that can be extracted from theapplication. One important factor which influences the instruction latency is the number of cycles it takesto access the register file. Whether it is a CISC architecture or a RISC architecture, practically everyinstruction accesses the register file for one or more operands. Therefore, its hardly surprising that processorarchitects have over the years designed architectures which enabled a one cycle read/write access to theregister file. Single-cycle accesses are also preferred because larger access times require deeper pipelines,and deeper pipelines induce more hazards, larger branch penalties, and more complex hardware for hazarddetection and data forwarding.Studies have shown that many of the ILP increasing techniques employed in wide-issue superscalarprocessors increase the demand for registers [3]. At the same time, an increased issue width of the processorrequires additional register ports. Typically, a 4-wide superscalar processor needs to have at least 8 readports and 4 write ports on its register file. Both of these affect the register file not only by making it muchbigger but also much slower.The access time of a register file consists of two distinct components: the wire propagation delay andthe fan-in/fan-out delay. Register files typically contain long word-lines and bit-lines, which can take a longtime to propagate a signal across their length. For the kind of register file structures considered here, thewire propagation delay is far greater than the fan-in/fan-out delay. Bigger register file and an increasednumber of ports result in a taller register file layout, which translates to longer word-lines and bit-lines [7],thereby increasing wire propagation delay. Also, wire delays do not at all scale with the silicon technologyimprovements. Thus as register files grow in size, with faster transistors (smaller feature sizes), its onlyexacerbates their delay problem.Over the past decade, researchers have suggested a number of techniques for alleviating the problem ofincreased wire delay. Whenever a large block of silicon takes up a large fraction of the cycle time, it usuallycommon to split the block up into smaller and m ore importantly faster pieces [6]. In the past, precioussilicon area dictated logic reuse, but these days designers frequently duplicate logic to reduce wire lengths.We believe that these couple of ideas could be applied to the register files as well.1.1 HypothesisWe hypothesize that splitting up the register file would not only reduce its delay but also make it scalebetter with technology. At the same time the functional units could be duplicated to limit wire lengths fromthe register partitions to them. This effectively provides every partition of the register file with its own setof functional units. Obviously, inter-cluster communication would be costly, and this technique would bebeneficial only if inter-cluster communication is rare. From the register usage patterns that are obtainedfrom a few benchmarks programs, we conclude that this is indeed the case.The extent of performance improvement that can be expe cted from our architecture is dependent directlyupon this cluster locality in the instruction stream. It also depends upon how many cycles we attribute toaccessing a monolithic register file. But we believe that there is room for some significant improvement by1Figure 1: Instruction breakdown of register accesses for a 2-way split of the register file.Figure 2: Instruction breakdown of register accesses for a 4-way split of the register file.our technique. Our architecture would scale much better with process improvements, when the monolithicregister file would become slower.Figure 1 shows the registe r usage pattern of a few SPEC CPU2000 integer benchmarks for a two-waysplit of the register file. We split up the register file such that each half of the registers form a cluster(registers 0–15 in one cluster and registers 16–31 in the other). We then use the SimpleScalar tool set toobtain the register usage statistics. From the figure, it is apparent that almost 90% of the instructions,in these benchmarks, show high locality in their register accesses. In other words, almost 90% of the timewe are likely to complete the instruction in just one cluster and would not need to pay the inter-clustercommunication penalty. Figure 2 shows similar data for a four-way split of the register file in which eachcluster has one quater of the registers.Based on these preliminary findings, we hypothesize that splitting up the register file and clustering the2RSALURSALUIssueLogicRegisterFileIFTUsesRemoteSourceLocalInstructionRSALURSALUIssueLogicRegisterFileIFTUsesRemoteSourceLocalInstructionGlobal ROBGlobalCommit LogicRename TableCluster 0Cluster 1Decodedand RenamedInstructionFigure 3: Overview of the Architecturearchitecture would indeed be a performance win. Also, as the size of the register file grows, and the latencyof wires increases, we expect our technique to perform e ven better.In our project we are targeting mainly the integer register file. At the same time however there isabsolutely nothing in our architecture that will prevent it from being applied to the floating point registers aswell. In other words, we expect almost all applications to benefit from our scheme, but we show performanceresults for only SPEC integer benchmarks.2 ArchitectureOur design is based on a modified SimpleScalar architecture [2]. Unlike the s tandard SimpleScalar archi-tecture, our design does not contain a register update unit (RUU). Our baseline architecture replaces theRUU with reservation stations, a limited reorder buffer, and a single register file. While functionally similar,the differences in the baseline are important for comparing parallel structures with our proposed architec-ture. The overall structure of our processor is shown in Figure 3. SimpleScalar parameters for the baselinearchitecture are shown in Table 2.Our prop os al is to partition elements of the architecture into clusters, splitting some of the resourcesand duplicating others. Figure 4 depicts the pipeline with detail as to which stages are divided. A singleinstruction


View Full Document
Download Partitioning Register File to Reduce Access Time
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Partitioning Register File to Reduce Access Time and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Partitioning Register File to Reduce Access Time 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?