Unformatted text preview:

Intel Pentium 4 A Detailed Description By Allis Kennedy Anna McGary For CPE 631 Dr Milenkovic Spring 2004 Intel Pentium 4 Outline P4 General Introduction Chip Layout Micro Architecture NetBurst Memory Subsystem Cache Hierarchy Branch Prediction Pipeline Hyper Threading Conclusions Pentium 4 General Introduction Intel Pentium 4 Introduction The Pentium 4 processor is Intel s new microprocessor that was introduced in November of 2000 The Pentium 4 processor Has 42 million transistors implemented on Intel s 0 18 CMOS process with six levels of aluminum interconnect Has a die size of 217 mm 2 Consumes 55 watts of power at 1 5 GHz 3 2 GB second system bus helps provide the high data bandwidths needed to supply data for demanding applications Implements a new Intel NetBurst microarchitecture Intel Pentium 4 Introduction cont d The Pentium 4 Extends Single Instruction Multiple Data SIMD computational model with the introduction of Streaming SIMD Extension 2 SSE2 and Streaming SIMD Extension 3 SSE3 that improve performance for multimedia content creation scientific and engineering applications Supports Hyper Threading HT Technology Has Deeper pipeline 20 pipeline stages Pentium 4 Chip Layout Pentium 4 Chip Layout 400 MHz System Bus Advanced Transfer Cache Hyper Pipelined Technology Enhanced Floating Point Multi Media Execution Trace Cache Rapid Execution Engine Advanced Dynamic Execution 400 MHz System Bus Quad Pump On every latch four addresses from the L2 cache are decoded into ops microoperations and stored in the trace cache 100 MHz System Bus yields 400 MHz data transfers into and out of the processor 200 MHz System Bus yields 800 MHz data transfers into and out of the processor Overall the P4 has a data rate of 3 2 GB s in and out of the processor Which compares to the 1 06 GB s in the PIII 133MHz system bus 400 MHz System Bus Ref 6 Advanced Transfer Cache Handles the first 5 stages of the Hyper Pipeline Located on the die with the processor core Includes data pre fetching 256 bit interface that transfers data on each core clock 256KB Unified L2 cache instruction data 8 way set associative 128 bit cache line 2 64 bit pieces reads 64 bytes in one go For a P4 1 4 GHz the data bandwidth between the ATC and the core is 44 8 GB s Advanced Transfer Cache Ref 6 Hyper Pipelined Technology Deep 20 stage pipeline Allows for signals to propagate quickly through the circuits Allows 126 in flight instructions Up to 48 load and 24 store instructions at one time However if a branch is mispredicted it takes a long time to refill the pipeline and continue execution The improved Trace Cache branch prediction unit is supposed to make pipeline flushes rare Hyper Pipelined Technology Ref 6 Enhanced Floating Point Multi Media Extended Instruction Set of 144 New Instructions Designed to enhance Internet and computing applications New Instructions Types 128 bit SIMD integer arithmetic operations 64 bit MMX technology Accelerates video speech encryption imaging and photo processing 128 bit SIMD double precision floating point operations Accelerates 3D rendering financial calculations and scientific applications Enhanced Floating Point Multi Media Ref 6 Execution Trace Cache Basically the execution trace cache is a L1 instruction cache that lies direction behind the decoders Holds the ops for the most recently decoded instructions Integrates results of branches in the code into the same cache line Stores decoded IA 32 instructions Removes latency associated with the CISC decoder from the main execution loops Execution Trace Cache Ref 6 Rapid Execution Engine Execution Core of the NetBurst microarchitecture Facilitates parallel execution of the ops by using 2 Double Pumped ALUs and AGUs D P ALUs handle Simple Instructions D P AGUs Address Generation Unit handles Loading Storing of Addresses Clocked with double the processors clock Can receive a op every half clock 1 Slow ALU Not double pumped 1 MMX and 1 SSE unit Compared to the PIII which had two of each Intel claims the additional unites did not improve the SSE SSE2 MMX or FPU performance Rapid Execution Engine Ref 6 Advanced Dynamic Execution Deep Out of Order Speculative Execution Engine Ensures execution units are busy Enhanced Branch Prediction Algorithm Reduces mispredictions by 33 from previous versions Significantly improves performance of processor Advanced Dynamic Execution Ref 6 Pentium 4 Micro Architecture NetBurst Intel NetBurst Microarchitecture Overview Designed to achieve high performance for integer and floating point computations at high clock rates Features hyper pipelined technology that enables high clock rates and frequency headroom up to 10 GHz a high performance quad pumped bus interface to the Intel NetBurst microarchitecture system bus a rapid execution engine to reduce the latency of basic integer instructions out of order speculative execution to enable parallelism superscalar issue to enable parallelism Intel NetBurst Microarchitecture Overview cont d Features Hardware register renaming to avoid register name space limitations Cache line sizes of 64 bytes Hardware pre fetch A pipeline that optimizes for the common case of frequently executed instructions Employment of techniques to hide stall penalties such as parallel execution buffering and speculation Pentium 4 Basic Block Diagram Ref 1 Pentium 4 Basic Block Diagram Description Four main sections The In Order Front End The Out Of Order Execution Engine The Integer and Floating Point Execution Units The Memory Subsystem Intel NetBurst Microarchitecture in Detail Ref 1 In Order Front End Consists of The Instruction TLB Pre fetcher The Instruction Decoder The Trace Cache The Microcode ROM The Front End Branch Predictor BTB Performs the following functions Pre fetches instructions that are likely to be executed Fetches required instructions that have not been pre fetched Decodes instructions into ops Generates microcode for complex instructions and special purpose code Delivers decoded instructions from the execution trace cache Predicts branches uses the past history of program execution to speculate where the program is going to execute next Instruction TLB Prefetcher The Instruction TLB Pre fetcher translates the linear instruction pointer addresses given to it into physical addresses needed to access the L2 cache and performs page level protection checking Intel NetBurst microarchitecture supports three pre fetching mechanisms A hardware instruction fetcher that


View Full Document

UAH CPE 631 - Intel Pentium 4

Loading Unlocking...
Login

Join to view Intel Pentium 4 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Intel Pentium 4 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?