IncreasingProcessor Performance by Implementing Deeper Pipelines Eric Sprangle Doug Carmean Pentium Processor Architecture Group Intel Corporation eric sprangle intel corn douglas m carmean intel corn Abstract pipelines Then we will describe how a cycle can be thought o f as the sum of useful time and overhead time and that the frequency can be increased by reducing the amount of useful time per cycle We will then show that deeper pipelines can increase the frequency to more than offset the decrease in IPC We will then describe how execution time can be thought of as the sum of core time and memory time and show how memory time can be reduced with larger caches Finally we will show how the combination o f deeper pipelines and larger caches can increase performance significantly One architectural method for increasing processor performance involves increasing the frequency by implementing deeper pipelines This paper will explore the relationship between performance and pipeline depth using a Pentiurr 4 processor like architecture as a baseline and will show that deeper pipelines can continue to increase performance This paper will show that the branch misprediction latency is the single largest contributor to performance degradation as pipelines are stretched and therefore branch prediction and fast branch recovery will continue to increase in importance We will also show that higher performance cores implemented with longer pipelines for example will put more pressure on the memory system and therefore require larger on chip caches Finally we will show that in the same process technology designing deeper pipelines can increase the processor frequency by 100 which when combined with larger on chip caches can yield performance improvements of 35 to 90 over a Pentium 4 like processor 3 Fundamental processor loops Performance can monotonically increase with increased pipeline depth as long as the latency associated with the pipeline is not exposed systematically Unfortunately due to the unpredictable nature o f code and data streams the pipeline cannot always be filled correctly and the flushing o f the pipeline exposes the latency These flushes are inevitable and pipeline exposures decrease IPC as the pipeline depth increases For example a branch misprediction exposes the branch misprediction pipeline and the exposure penalty increases as the pipeline depth increases The L1 cache pipeline can also be exposed if there are not enough independent memory operations sent to the L1 cache to saturate the pipeline Of course some pipeline latencies are more important than others We simulated the performance sensitivities to the various loops on a Pentium 4 processor like architecture to understand which loops are the most performance sensitive 1 Introduction Determining the target frequency of the processor is one of the fundamental decisions facing a microprocessor architect While historical debate of pushing frequency or IPC to improve performance continues many argue that modern processors have pushed pipelines beyond their optimal depth With the fundamental debate raging most agree that the engineering complexity and effort increases substantially with deeper pipelines Focusing on single stream performance and using the Pentium 4 processor as a baseline architecture this paper will conclude that pipelines can be further lengthened beyond the Pentium 4 processor s 20 stages to improve performance We assert that architectural advances will enable even deeper pipelines although engineering effort and other considerations may be the real limiter 4 Simulation methodology We conducted our experiments using an execution driven simulator called Skeleton which is a high level simulator that is typically used for coarse level architectural trade off analysis The simulator is layered on top o f a uOp level IA32 architectural simulator that executes Long Instruction Trace LIT s A LIT is not as the name implies a trace rather it is a snapshot of processor architectural state that includes the state of system memory Included in the LIT is a list o f LIT injections which are system interrupts that are needed to 2 Overview We will propose a model to predict performance as a function o f pipeline depth and cache size First we will determine the sensitivity o f IPC to the depth o f important 1063 6897 02 17 00 2002 I E E E 25 simulate system events such as DMA traffic Since the LIT includes an entire snapshot o f memory this methodology can execute both user and kernel instructions as well as wrong path effects Our simulation methodology uses carefully chosen 30 million instruction program snippets to model the characteristics o f the overall application Our simulations are based on a Pentium 4 like processor described in Table 1 The results will be limited to the suites listed in Table 2 for a total o f 91 benchmarks that are comprised o f 465 L I T s It is interesting to note that SPECint95 is much more sensitive to the branch misprediction latency than the other application classes To a lesser extent SPECint2k also shows greater sensitivity to branch misprediction latency than the other application classes In this sense SPECint95 in particular is not representative o f general desktop applications because o f the higher branch misprediction rates 1 0 98 J r Table 1 Simulated 2GHz Pentium 4 like processor configuration 0 94 Core 3 wide fetch retire 2 ALUs running at 2x frequency 1 load and store I cycle In order al ocation de allocation of buffers 512 rob entries load buffers and store buffers Memory System 64 kB 8 way I cache 8 kB 4 way L1 D cache 2 cycle latency 256 kB 8 way unified L2 cache 12 cycle latency 3 2 GB sec memory system 165ns average latency Perfect memory disambiguation 16 kB Gshare branch predictor Streamin l based hardware prefetcher 0 92 0 9 Number of Benchmarks SPECint95 8 Multimedia 22 Productivity 13 SPECfp2k 10 20 12 spec org 14 CAD rendering Internet 12 webmark2k specjbb 22 24 26 28 30 Figure 1 Normalized performance vs branch misprediction latency To understand the sensitivity to the ALU loop latency we started with a baseline processor that implements half clock cycle add operations like the implementation in the Pentium 4 processor The Pentium 4 processor pipelines the ALU operation into 3 half cycles lower 16 bit ALU upper 16 bit ALU flag generation 2 Figure 2 shows the effect o f increasing the ALU latency from 1 half clock cycle to 3 full clock cycles while keeping the
View Full Document
Unlocking...