Rice ELEC 525 - Managing Interconnect Delay with Architectural and Compiler Techniques

Unformatted text preview:

Managing Interconnect Delay With Architectural and Compiler Techniques Walt Fish, Chris Flesher, David Suksumrit, Allen Wan, Abstract Interconnect delay is becoming an increasingly dominant constraint in modern processor design. Already, several modern processors require extra pipeline stages to account for interconnect delay, and a signal crossing the entire chip can require several cycles to propagate. Until recently, the interconnect delays between ALUs and the register file were dwarfed by logic delay. However, techniques in managing delay due to interconnects will become more crucial as technology scaling causes interconnect delay to account for an increasing portion of functional unit execution time. We propose to address the problem of interconnect delay through the use of register bank/ALU clusters, created by partitioning the register file into separate banks, each associated with a nearby functional units. This means that instructions whose operands are stored in registers adjacent to their intended functional unit do not suffer additional interconnect delay due to long propagation distance, while instructions whose operands are in a separate cluster will suffer a longer interconnect delay penalty. We further propose to create compiler optimizations to ensure that operands produced and consumed by functional units will be in registers close to the local ALU cluster whenever possible, thus ensuring that a minimum of instructions will have the penalty of the longer, inter-cluster communication delay. I. Introduction As technology features scale in size, the delay due to logic in gates improves by a factor relative to the feature scaling, but the delay due to interconnect decreases only slightly. Gate delay is due to the width of the transistors involved, and improves by a factor relative to feature scaling. Interconnect delay is modeled by an RC time constant, with R being the resistance of a wire and C the lumped coupling capacitance with other metal features. As features scale downward in size, C improves by a factor of the scaling, but making wires smaller increases R by the factor of the scaling. This effect causes interconnect delay to have poor improvement relative to the rest of technology. With modern superscalar processors attempting faster clock speeds, the portion of the chip that a signal can travel across via interconnect decreases. With current organizations of the register file and execution units, we will soon reach a point where any operation involving the register file will require an additional clock cycle for signal transfer. The limiting factor in such a case would be the delay of the worst-case scenario in which operands must be transmitted between registers located on the edges of the register file and an ALU that is physically far away. As interconnect delay has grown relative to logic delay, circuit designers have had to add extra pipeline stages to allow for data to travel between the register file and ALUs. At 250nm, logic delay equals interconnect delay as a source of latency in circuit design1. Requiring an extra cycle for operations or adding another pipeline stage could be seen as a step backward in superscalar processing.In a 2002 update of the International Technology Roadmap for Semiconductors (ITRS), interconnect delay was identified as an area where “design and layout solutions are needed”2. One emerging technique for addressing this hurdle is the notion of exposing elements previously hidden by the ISA3. Exposing these elements gives compilers and programmers the ability to explicitly account for and manage these obstacles. Our approach extends this concept by proposing a new architecture combined with compiler optimizations that exploit low-level manipulation of system latencies, allowing faster execution time. Our paper addresses a hypothetical situation where interconnect delay has grown to the point where an integer operation can take several cycles due to interconnect delay. Current superscalar techniques such as Tomasulo’s algorithm4, reorder buffer and multiple ALUs will not be able to hide the increased delays without further advances. In order to prevent this performance degradation, we need to limit the total amount of wire used by the path of an ALU instruction. To achieve this end, we intend to place ALUs and reservation stations closer to the register file by partitioning the file into smaller pieces, each with its own dedicated ALUs. Data within a cluster can originate from the register file, undergo computation and be written back to the register file in one cycle. To ensure that the maximum portion of instructions remain inside one cluster and do not suffer the penalty of interconnect delay due to inter-cluster communication, we propose a compiler optimization which attempts to reassign all source and destination operands by their location into one of the clusters. Motivation As interconnect delay has increased in relation to logic delay, the time that it takes for data to travel from the registers to the ALU have risen proportionally to become equal or greater than the time required to operate on the data in the ALUs. Since variations in this propagation delay are not exposed, a processor must allow for the worst case delay on all ALU operations even if the operands are coming from the physically closest registers. We propose to separate the register file into banks, each with its associated functional unit so that only a portion of total instructions must suffer the delay from operands being communicated from one cluster to another. When an instruction has both source and destination operands within one bank, the instruction can complete significantly faster in its associated ALU than in the case of an architecture where operations have to take into account the worst-case delay. If an instruction uses operands from different banks, it will have more delay than the optimal case due to longer wire paths for inter-bank communication from the registers to the ALUs. We examined other architectural features to determine whether or not they had a large effect on solving the problem of interconnect delay. We analyzed various configurations for the SimpleScalar Register Update Unit (RUU). As can be seen in figure1, as we increased RUU size for vpr (a SPECint2000 benchmark), there was very little performance benefit. Even after increasing the RUU size until the RUU was never full, we


View Full Document

Rice ELEC 525 - Managing Interconnect Delay with Architectural and Compiler Techniques

Download Managing Interconnect Delay with Architectural and Compiler Techniques
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Managing Interconnect Delay with Architectural and Compiler Techniques and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Managing Interconnect Delay with Architectural and Compiler Techniques 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?