DOC PREVIEW
CMU CS 15740 - IncreasingProcessor Performance by Implementing Deeper Pipelines

This preview shows page 1-2-3 out of 10 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 10 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 10 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 10 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 10 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

IncreasingProcessor Performance by Implementing Deeper Pipelines Eric Sprangle, Doug Carmean Pentium® Processor Architecture Group, Intel Corporation eric.,sprangle@intel, corn, douglas, m. carmean@intel, corn Abstract One architectural method for increasing processor performance involves increasing the frequency by implementing deeper pipelines. This paper will explore the relationship between performance and pipeline depth using a Pentiurr~ 4 processor like architecture as a baseline and will show that deeper pipelines can continue to increase performance. This paper will show that the branch misprediction latency is the single largest contributor to performance degradation as pipelines are stretched, and therefore branch prediction and fast branch recovery will continue to increase in importance. We will also show that higher performance cores, implemented with longer pipelines for example, will put more pressure on the memory system, and therefore require larger on-chip caches. Finally, we will show that in the same process technology, designing deeper pipelines can increase the processor frequency by 100%, which, when combined with larger on-chip caches can yield performance improvements of 35% to 90% over a Pentium@ 4 like processor. 1. Introduction Determining the target frequency of the processor is one of the fundamental decisions facing a microprocessor architect. While historical debate of pushing frequency or IPC to improve performance continues, many argue that modern processors have pushed pipelines beyond their optimal depth. With the fundamental debate raging, most agree that the engineering complexity and effort increases substantially with deeper pipelines. Focusing on single stream performance, and using the Pentium® 4 processor as a baseline architecture, this paper will conclude that pipelines can be further lengthened beyond the Pentium® 4 processor's 20 stages to improve performance. We assert that architectural advances will enable even deeper pipelines, although engineering effort and other considerations may be the real limiter. 2. Overview We will propose a model to predict performance as a function of pipeline depth and cache size. First, we will determine the sensitivity of IPC to the depth of important pipelines. Then, we will describe how a cycle can be thought of as the sum of "useful time" and "overhead time", and that the frequency can be increased by reducing the amount of "useful time" per cycle. We will then show that deeper pipelines can increase the frequency to more than offset the decrease in IPC. We will then describe how execution time can be thought of as the sum of "core time" and "memory time" and show how "memory time" can be reduced with larger caches. Finally, we will show how the combination of deeper pipelines and larger caches can increase performance significantly. 3. Fundamental processor loops Performance can monotonically increase with increased pipeline depth as long as the latency associated with the pipeline is not exposed systematically. Unfortunately, due to the unpredictable nature of code and data streams, the pipeline cannot always be filled correctly and the flushing of the pipeline exposes the latency. These flushes are inevitable, and pipeline exposures decrease IPC as the pipeline depth increases. For example, a branch misprediction exposes the branch misprediction pipeline, and the exposure penalty increases as the pipeline depth increases. The L1 cache pipeline can also be exposed if there are not enough independent memory operations sent to the L1 cache to saturate the pipeline. Of course, some pipeline latencies are more important than others, We simulated the performance sensitivities to the various loops on a Pentium® 4 processor like architecture to understand which loops are the most performance sensitive. 4. Simulation methodology We conducted our experiments using an execution driven simulator called "Skeleton", which is a high level simulator that is typically used for coarse level architectural trade-off analysis. The simulator is layered on top of a uOp-level, IA32 architectural simulator that executes "Long Instruction Trace (LIT)"s. A LIT is not, as the name implies, a trace, rather it is a snapshot of processor architectural state that includes the state of system memory. Included in the LIT is a list of "LIT injections" which are system interrupts that are needed to 1063-6897/02 $17.00 © 2002 IEEE 25simulate system events such as DMA traffic. Since the LIT includes an entire snapshot of memory, this methodology can execute both user and kernel instructions, as well as wrong path effects. Our simulation methodology uses carefully chosen, 30 million instruction program snippets to model the characteristics of the overall application. Our simulations are based on a Pentium® 4 like processor described in Table 1. The results will be limited to the suites listed in Table 2 for a total of 91 benchmarks that are comprised of 465 LITs. Table 1: Simulated 2GHz Pentium® 4 like processor configuration. Core 3-wide fetch/retire 2 ALUs (running at 2x frequency) 1 load and store I cycle In-order al/ocation/de-allocation of buffers 512 rob entries, load buffers and store buffers Memory System 64 kB/8-way I-cache " 8 kB/4-way L1 D-cache, 2 cycle latency 256 kB/8-way unified L2 cache, 12 cycle latency 3.2 GB/sec memory system, 165ns average latency Perfect memory disambiguation 16 kB Gshare branch predictor Streamin~l based hardware prefetcher Table 2: Simulated Benchmark Suites Number of Suite Benchmarks Description SPECint95 8 spec.org Multimedia Productivity SPECfp2k SPECint2k 22 13 speech recognition, mpeg, photoshop, ray tracin~l, rsa sysmark2k internet/business/ productivity, Premiere 10 spec.org 12 spec.org Workstation 14 CAD, rendering Internet 12 webmark2k, specjbb 5. Efficiency vs. pipeline depth Figure 1 shows the relative IPC as the branch misprediction penalty is increased from 20 to 30 cycles. We can determine the average branch misprediction latency sensitivity by


View Full Document

CMU CS 15740 - IncreasingProcessor Performance by Implementing Deeper Pipelines

Documents in this Course
leecture

leecture

17 pages

Lecture

Lecture

9 pages

Lecture

Lecture

36 pages

Lecture

Lecture

9 pages

Lecture

Lecture

13 pages

lecture

lecture

25 pages

lect17

lect17

7 pages

Lecture

Lecture

65 pages

Lecture

Lecture

28 pages

lect07

lect07

24 pages

lect07

lect07

12 pages

lect03

lect03

3 pages

lecture

lecture

11 pages

lecture

lecture

20 pages

lecture

lecture

11 pages

Lecture

Lecture

9 pages

Lecture

Lecture

10 pages

Lecture

Lecture

22 pages

Lecture

Lecture

28 pages

Lecture

Lecture

18 pages

lecture

lecture

63 pages

lecture

lecture

13 pages

Lecture

Lecture

36 pages

Lecture

Lecture

18 pages

Lecture

Lecture

17 pages

Lecture

Lecture

12 pages

lecture

lecture

34 pages

lecture

lecture

47 pages

lecture

lecture

7 pages

Lecture

Lecture

18 pages

Lecture

Lecture

7 pages

Lecture

Lecture

21 pages

Lecture

Lecture

10 pages

Lecture

Lecture

39 pages

Lecture

Lecture

11 pages

lect04

lect04

40 pages

Load more
Download IncreasingProcessor Performance by Implementing Deeper Pipelines
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view IncreasingProcessor Performance by Implementing Deeper Pipelines and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view IncreasingProcessor Performance by Implementing Deeper Pipelines 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?