MICROPROCESSOR REPORT Intel s P6 Uses Decoupled Superscalar Design Next Generation of x86 Integrates L2 Cache in Package with CPU by Linley Gwennap Intel s forthcoming P6 processor see cover story is designed to outperform all other x86 CPUs by a significant margin Although it shares some design techniques with competitors such as AMD s K5 NexGen s Nx586 and Cyrix s M1 the new Intel chip has several important advantages over these competitors The P6 s deep pipeline eliminates the cache access bottlenecks that restrict its competitors to clock speeds of about 100 MHz The new CPU is designed to run at 133 MHz in its initial 0 5 micron BiCMOS implementation a 0 35 micron version due next year could push the speed as high as 200 MHz In addition the Intel design uses a closely coupled secondary cache to speed memory accesses a critical issue for high frequency CPUs Intel will combine the P6 CPU and a 256K cache chip into a single PGA package reducing the time needed for data to move from the cache to the processor Like some of its competitors the P6 translates x86 instructions into simple fixed length instructions that Intel calls micro operations or uops pronounced youops These uops are then executed in a decoupled superscalar core capable of register renaming and out oforder execution Intel has given the name dynamic execution to this particular combination of features which is neither new nor unique but highly effective in increasing x86 performance The P6 also implements a new system bus with increased bandwidth compared to the Pentium bus The new bus is capable of supporting up to four P6 processors with no glue logic reducing the cost of developing and building multiprocessor systems This feature set makes the new processor particularly attractive for servers it will also be used in high end desktop PCs and eventually in mainstream PC products Not Your Grandfather s Pentium While Pentium s microarchitecture carries a distinct legacy from the 486 it is hard to find a trace of Pentium in the P6 The P6 team threw out most of the design techniques used by the 486 and Pentium and started from a blank piece of paper to build a high performance x86 compatible processor The result is a microarchitecture that is quite radical compared with Intel s previous x86 designs but one that draws from the same bag of tricks as competitors x86 chips To this mix the P6 adds high performance Intel s P6 Uses Decoupled Superscalar Design cache and bus designs that allow even large programs to make good use of the superscalar CPU core As Figure 1 see below shows the P6 can be divided into two portions the in order and out of order sections Instructions start in order but can be executed out of order Results flow to the reorder buffer ROB which puts them back into the correct order Like AMD s K5 see 081401 PDF the P6 uses the ROB to hold results that are generated by speculative and out of order instructions if it turns out that these instructions should not have been executed their results can be flushed from the ROB before they are committed The performance increase over Pentium comes largely from the out of order execution engine In Pentium if an instruction takes several cycles to execute due to a cache miss or other long latency operation the entire processor stalls until that instruction can proceed In the same situation the P6 will continue to execute subsequent instructions coming back to the stalled instruction once it is ready to execute Intel estimates that the P6 by avoiding stalls delivers 1 5 SPECint92 per MHz about 40 better than Pentium x86 Instructions Translate to Micro ops The P6 CPU includes an 8K instruction cache that is similar in structure to Pentium s On each cycle it can deliver 16 aligned bytes into the instruction byte queue Unlike Pentium the P6 cache cannot fetch an unaligned cache line throttling the decode process when poorly aligned branch targets are encountered Any hiccups in the fetch stream however are generally hidden by the deep queues in the execution engine The instruction bytes are fed into three instruction decoders The first decoder at the front of the queue can handle any x86 instruction the others are restricted to only simple e g register to register instructions Instructions are always decoded in program order so if an instruction cannot be handled by a restricted decoder neither that instruction nor any subsequent ones can be decoded on that cycle the complex instruction will eventually reach the front of the queue and be decoded by the general decoder Assuming that instruction bytes are available at least one x86 instruction will be decoded per cycle but more than one will be decoded only if the second and third instructions fall into the restricted category Intel refused to list these instructions but they do not include any that operate on memory Thus the P6 s ability to execute more than one x86 instruction per cycle relies Vol 9 No 2 February 16 1995 1995 MicroDesign Resources MICROPROCESSOR REPORT Instr TLB 32 entry 8K Instruction Cache 128 64 Branch Target Buffer Simple Decoder Simple Decoder General Decoder Instruction Fetch Unit 1 uop 1 uop RAT RRF Uop Sequencer IN ORDER SECTION Reorder Buffer 4 uops 40 entries 3 uops Reservation Station 20 entries Store Data Store Addr Unit Load Addr Unit Memory Reorder Buffer MOB 1 store Data TLB 64 entry Integer ALU FP Unit Integer Unit OUT OF ORDER EXECUTION ENGINE 1 load load data 32 8K Dual Ported Data Cache 64 System Bus Interface 36 addr 64 data L2 Cache Interface 64 data Figure 1 The P6 combines an in order front end with a decoupled superscalar execution engine that can process RISC like microops speculatively and out of order on avoiding long sequences of complex instructions or instructions that operate on memory The decoders translate x86 instructions into uops P6 uops have a fixed length of 118 bits using a regular structure to encode an operation two sources and a destination The source and destination fields are each wide enough to contain a 32 bit operand Like RISC instructions uops use a load store model x86 instructions that operate on memory must be broken into a load uop an ALU uop and possibly a store uop The restricted decoders can produce only one uop per cycle and thus accept only instructions that translate into a single uop The generalized decoder is capable of generating up to four uops per cycle Instructions that require more than four uops are handled by a uop sequencer that
View Full Document