Unformatted text preview:

Intel Discloses New IA 64 Features Rotating Registers Reduce Code Expansion Merced Touted for Big Servers by Linley Gwennap In a series of talks at the recent Intel Developers Forum the company tantalized industry watchers by dribbling out a few more details about its IA 64 instruction set and its first implementation Merced In a joint presentation by Intel s John Crawford and Hewlett Packard s Jerry Huck the two architects shed additional light on the IA 64 design They provided further details on the architecture s support for predication and speculation and also described IA 64 s branch architecture A newly disclosed feature rotating registers provides an efficient way to unroll loops while minimizing code expansion In other talks Intel disclosed that Merced and its first chip set the 460GX will support high availability features required in large servers The company asserts that fourprocessor Merced servers will deliver more performance on the TPC C benchmark than four way servers using 1 GHz Alpha 21264 processors or 750 MHz UltraSparc 3 processors two key Merced rivals that are expected to ship next year But it has yet to disclose any details about clock speed bus bandwidth or other metrics to support this position Register Renaming Implemented in Software One of the key philosophies of IA 64 is the idea of moving complexity from the hardware to the software Register renaming is one example Most high end processors map a small number 8 32 of logical registers onto a larger set of physical registers up to 80 in the case of the 21264 Because software can access only the logical registers the hardware must assign mappings and translate accesses using an associative lookup table This complexity increases die size and often the pipeline depth as well IA 64 eliminates this hardware complexity with its large register file 128 integer 128 floating point that is directly accessible by software Specifying the physical register names in software works well except in the case of tight MEMCPY LOOP loops a common occurrence In these short code sequences there may not be enough instructions in the loop to cover the latency of load instructions resulting in unwanted stalls An out of order processor reorders instructions to cover the latency of the loads The reordering naturally overlaps instructions from two or more iterations of the loop until enough instructions are found to overcome the latency or the hardware runs out of resources This overlap will cause register conflicts since each loop iteration references the same registers but these conflicts are resolved by hardware register renaming An IA 64 processor can address the latency problem by unrolling the loop in software This common compiler technique duplicates the loop instructions often several times to generate enough instructions to cover the load latencies Each duplicate set of instructions however must use a different set of registers to avoid collisions IA 64 has plenty of registers available but all of these duplicate instructions can create massive code expansion Rotating Registers Compact Code To reduce code expansion IA 64 uses its rotating registers With this technique the upper three quarters of each register file integer FP and predicates rotates leaving the lower registers for global variables Accesses to these upper registers are offset by the value in the corresponding RRB rotating register base register A special instruction BR CTOP decrements each of the RRBs by one at the end of each loop iteration allowing the next iteration to use a new set of physical registers With proper spacing several variables can be rotated through the register file at once The rotating predicate registers provide a simple way to handle loop setup prologue and termination epilogue If the prologue and epilogue instructions are appropriately predicated and the predicate registers rotated the prologue instructions are executed only during the initial iteration s of the loop and the epilogue instructions are executed only for i 0 i n i b a a PA RISC with hardware reordering b IA 64 with rotating registers Set up r2 loop count r10 source addr r11 destination addr Set up LC loop count 1 r10 source addr r11 destination addr Clear predicate registers set p16 set EC epilogue count loop LDWM r1 r10 STWM r11 r1 ADDIB r2 1 loop Load into r1 inc addr Store from r1 inc addr Decr loop count and branch loop p16 LD8 r34 r10 8 p17 ST8 r11 r35 8 BR CTOP loop Load into r34 inc addr Store from previous r34 inc addr Decr loop count and branch Figure 1 In a simple memory copy loop a PA RISC processor with hardware reordering will cover the latency of the first load by launching subsequent loads creating multiple versions of r1 using hardware renaming Without adding instructions to the loop an IA 64 processor will accomplish the same effect by rotating its registers in this case r35 refers to the previous iteration of r34 MICRODESIGN RESOURCES MARCH 8 1999 MICROPROCESSOR REPORT 2 I N T E L D I S C L O S E S N E W I A 6 4 F E AT U R E S during the final iteration s of the loop Some setup is still required to properly initialize the predicates but this can be done well in advance of beginning the loop removing this setup from the critical path Eschewing an orthogonal register set HP and Intel added several special registers to implement this process The 64 bit LC loop count register performs its eponymous function The 6 bit EC epilogue count register controls the execution of epilogue instructions Three RRBs each 6 or 7 bits rotate the integer FP and predicate registers as described above The use of special registers allows the BR CTOP instruction to specify several operations at once but in the common case of nested loops register rotation can be used in only one of the loops This method of register renaming allows a single copy of the loop code to be unrolled in hardware rather than software eliminating most of the code expansion as Figure 1 shows Rotating the registers adds some complexity a few 7 bit registers and adders to the hardware but it adds far less than the fully generic renaming hardware in a reordering CPU The rotating register concept dates back to Cydrome s Cydra 5 one of the original VLIW processors not coincidentally its architect Bob Rau is now on staff at HP By handling epilogue and prologue issues in a simple fashion IA 64 s rotating registers are appropriate even for loops that iterate only a few times Thus this technique can be broadly


View Full Document

CMU CS 15740 - Intel Discloses New IA-64 Features

Documents in this Course
leecture

leecture

17 pages

Lecture

Lecture

9 pages

Lecture

Lecture

36 pages

Lecture

Lecture

9 pages

Lecture

Lecture

13 pages

lecture

lecture

25 pages

lect17

lect17

7 pages

Lecture

Lecture

65 pages

Lecture

Lecture

28 pages

lect07

lect07

24 pages

lect07

lect07

12 pages

lect03

lect03

3 pages

lecture

lecture

11 pages

lecture

lecture

20 pages

lecture

lecture

11 pages

Lecture

Lecture

9 pages

Lecture

Lecture

10 pages

Lecture

Lecture

22 pages

Lecture

Lecture

28 pages

Lecture

Lecture

18 pages

lecture

lecture

63 pages

lecture

lecture

13 pages

Lecture

Lecture

36 pages

Lecture

Lecture

18 pages

Lecture

Lecture

17 pages

Lecture

Lecture

12 pages

lecture

lecture

34 pages

lecture

lecture

47 pages

lecture

lecture

7 pages

Lecture

Lecture

18 pages

Lecture

Lecture

7 pages

Lecture

Lecture

21 pages

Lecture

Lecture

10 pages

Lecture

Lecture

39 pages

Lecture

Lecture

11 pages

lect04

lect04

40 pages

Load more
Loading Unlocking...
Login

Join to view Intel Discloses New IA-64 Features and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Intel Discloses New IA-64 Features and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?