Demystifying EPIC and IA 64 EPIC Is a Natural Evolution of RISC Making It Easy to Retrofit Onto RISC by Peter Song features can be added without breaking compatibility EPIC is a natural evolution of RISC its fixed length instruction formats and load store instructions enable the EPIC features to be added easily Using a next generation architecture technology that Intel and Hewlett Packard call EPIC explicitly parallel instruction computing Merced and future EPIC processors threaten the performance lead held today by RISC processors EPIC is not entirely new borrowing many of its ideas from previous RISC and VLIW designs as well as from recent academic research EPIC has an inherent performance advantage over existing architectures however because it is a synergistic assembly of the latest innovations into one architecture To compete with EPIC processors from Intel existing RISC architectures are likely to adopt a similar combination of EPIC features in their future versions During last year s Microprocessor Forum Intel and HP gave a high level incomplete description of IA 64 for which the companies coined the generic name EPIC see MPR 10 27 97 p 1 Nevertheless we know that EPIC provides a large number of addressable registers eliminating the need for register renaming and reducing cache accesses It also provides instruction dependency hints simplifying instruction issue logic EPIC uses predicated execution to eliminate some branches thereby increasing scheduling freedom for the compiler allowing parallel execution of both paths of branches and reducing opportunities for misprediction EPIC uses speculative loads to enable well behaved accesses to memory as soon as the address can be computed hiding memory latency Intel and HP have revealed only a few details of EPIC and IA 64 but we can project more details than publicly disclosed by considering how these EPIC features can be applied to solve today s performance bottlenecks IA 64 may impose programming restrictions to accommodate clustering of execution units and registers greatly simplifying hardware without unduly degrading the processor s throughput It may also use delayed branches to specify branch target addresses as early as possible reducing reliance on accurate branch prediction IA 64 may use load store instructions that also return the effective address as a result reducing the overhead of hoisting speculative loads above earlier stores At first glance retrofitting these EPIC features onto an existing instruction set seems to require adding more bits breaking binary compatibility with existing software While a few new instructions can be added easily to an instruction set using unused opcodes adding general purpose registers and predicated execution seems more difficult or even impossible without breaking binary compatibility For many RISC architectures however most if not all of the known EPIC Figure 1 IA 64 processors may group registers and function units into execution clusters allowing implementations to use smaller crossbars and fewer global wires MICRODESIGN 26 1998 RESOURCES JA N UA RY IA 64 Likely to Embrace Clustered Designs IA 64 has 128 integer and 128 floating point registers four times as many registers as a typical RISC architecture allowing the compiler to expose and express an increased amount of ILP instruction level parallelism Merced and future IA 64 processors are expected to have more execution units than today s high performance processors taking advantage of the heightened ILP to deliver better performance While additional registers and execution units can improve a processor s throughput they generally degrade the processor s cycle time since a crossbar is needed between the registers and the execution units in most general purpose processors The crossbar enables the execution units to access any register without interfering with each other and is built into the register file High performance designs generally use another crossbar for forwarding results from one execution unit to all units that may need the results saving one or more cycles required for writing the results to the register file and then reading them Adding registers or execution units increases the number of switches and wires in the crossbars as well as the wire lengths and the capacitive loading resulting in longer delays through the crossbars Extra metal layers do not reduce a crossbar s size or its propagation delays since the switches are built using transistors Because wire delays take Inst Fetch Unit IUs address Data Access Unit GRs 0 31 cluster 0 IUs GRs 32 63 cluster 1 IUs x MMX GRs 64 95 cluster 2 IUs MMX GRs 96 127 cluster 3 Data FRs 0 31 FUs MAD cluster 4 FRs 32 63 FUs MAD cluster 5 FRs 64 95 FUs MAD cluster 6 MICROPROCESSOR FRs 96 127 FUs sqrt cluster 7 REPORT 2 DEMYSTIFYING EPIC AND IA 64 Clusters Expose Parallelism Avoid VLIW Flaw an increasingly large fraction of cycle times as process geometries shrink a trend that is unlikely to reverse in the foreseeable future we expect new architectures including IA 64 to adopt features that require smaller crossbars and fewer global wires IA 64 is likely to embrace partitioning the processor core registers and execution units into clusters at the architectural level reducing the burden of connecting the plethora of registers and execution units For example it could partition the 128 registers into four 32 register banks and restrict most instructions to accessing registers from only one bank Such a restriction would allow the processor core to be built in clusters as Figure 1 shows each consisting of a bank of registers and a set of function units Since the crossbars in each cluster connect fewer registers and function units resulting in fewer register file ports and result forwarding paths they are smaller and have shorter propagation delays than a crossbar connecting all registers and all function units Using smaller crossbars the processor core can operate at a higher clock speed without taking an extra cycle for the function units to forward their results to each other There may be paths between the clusters for copying registers from one bank to another using explicit move instructions Each path would add a write port to each register and possibly a result forwarding path to each execution unit Digital s 21264 see MPR 10 28 96 p 11 uses a clustered design in which each pair of integer and address generation units has its own copy of the integer registers reducing the
View Full Document
Unlocking...