Intel Pentium 4 ProcessorOutlineIntroductionIA-32IA-32 (cont’d)Pentium III vs. Pentium 4 PipelineComparison Between Pentium3 and Pentium4Execution on MPEG4 Benchmarks @ 1 GHzInstruction Set ArchitectureSSE2Comparison Between SSE and SSE2PackingHardware Support for SSE2SSE2 Instructions (1)SSE2 Instructions (2)SSE2 Instructions (3)ConclusionInstruction StreamSlide 19Front EndPrefetchDecoderTrace CacheWhat is a Trace Cache?Pentium 4 Trace CacheMicrocode ROMBranch PredictionSlide 28Branch Target BufferReturn Address StackBranch HintsOut-of-Order ExecutionIssueExecutionExecution UnitsDouble-pumped ALUsRetirementExecution PipelineSlide 39Data Stream of Pentium 4 ProcessorRegister RenamingRegister Renaming (2)On-chip CachesL1 Instruction CacheL1 Data CacheL2 CacheData Prefetcher in L2 CacheStore and LoadStoreStore-to-Load ForwardingSystem BusSlide 52What Went WrongNo L3 cacheSmall L1 CacheLoses consistently to AMDNorthwoodSlide 58PrescottSlide 60Intel Pentium 4 ProcessorIntel Pentium 4 Processor Presented byPresented by Michele CoMichele Co(much slide content courtesy of Zhijian Lu and Steve Kelley)(much slide content courtesy of Zhijian Lu and Steve Kelley)OutlineOutlineIntroduction (Zhijian)Introduction (Zhijian)–Willamette (11/2000)Willamette (11/2000)Instruction Set Architecture (Zhijian)Instruction Set Architecture (Zhijian)Instruction Stream (Steve)Instruction Stream (Steve)Data Stream (Zhijian)Data Stream (Zhijian)What went wrong (Steve)What went wrong (Steve)Pentium 4 revisionsPentium 4 revisions–Northwood (1/2002)Northwood (1/2002)–Xeon (Prestonia, ~2002)Xeon (Prestonia, ~2002)–Prescott (2/2004)Prescott (2/2004)Dual CoreDual Core–SmithfieldSmithfieldIntroductionIntroductionIntel Pentium 4 processor Intel Pentium 4 processor –Latest IA-32 processor equipped with a full set Latest IA-32 processor equipped with a full set of IA-32 SIMD operationsof IA-32 SIMD operationsFirst implementation of a new micro-First implementation of a new micro-architecture called “NetBurst” by Intel architecture called “NetBurst” by Intel (11/2000) (11/2000)IA-32IA-32Intel architecture 32-bit (IA-32)Intel architecture 32-bit (IA-32)–80386 instruction set (1985)80386 instruction set (1985)–CISC, 32-bit addressesCISC, 32-bit addresses““Flat” memory model Flat” memory model RegistersRegisters–Eight 32-bit registersEight 32-bit registers–Eight FP stack registersEight FP stack registers–6 segment registers6 segment registersIA-32 (cont’d)IA-32 (cont’d)Addressing modesAddressing modes–Register indirect (mem[reg])Register indirect (mem[reg])–Base + displacement (mem[reg + const])Base + displacement (mem[reg + const])–Base + scaled index (mem[reg + (2Base + scaled index (mem[reg + (2scalescale x index)]) x index)])–Base + scaled index + displacement (mem[reg + (2Base + scaled index + displacement (mem[reg + (2scalescale x x index) + displacement])index) + displacement])SIMD instruction setsSIMD instruction sets–MMX (Pentium II)MMX (Pentium II)»Eight 64-bit MMX registers, integer ops onlyEight 64-bit MMX registers, integer ops only–SSE (Streaming SIMD Extension, Pentium III)SSE (Streaming SIMD Extension, Pentium III)»Eight 128-bit registersEight 128-bit registersPentium III vs. Pentium 4 PipelinePentium III vs. Pentium 4 PipelineComparison Between Pentium3 and Comparison Between Pentium3 and Pentium4Pentium4Execution on MPEG4 Benchmarks @ 1 GHzExecution on MPEG4 Benchmarks @ 1 GHzInstruction Set ArchitectureInstruction Set ArchitecturePentium4 ISA =Pentium4 ISA = Pentium3 ISA +Pentium3 ISA + SSE2 (Streaming SIMD Extensions 2)SSE2 (Streaming SIMD Extensions 2)SSE2 is an architectural enhancement to SSE2 is an architectural enhancement to the IA-32 architecturethe IA-32 architectureSSE2SSE2Extends MMX and the SSE extensions with Extends MMX and the SSE extensions with 144 new instructions:144 new instructions:128-bit SIMD integer arithmetic operations128-bit SIMD integer arithmetic operations128-bit SIMD double precision floating 128-bit SIMD double precision floating point operationspoint operationsEnhanced cache and memory management Enhanced cache and memory management operationsoperationsComparison Between SSE and SSE2Comparison Between SSE and SSE2Both support operations on 128-bit XMM register Both support operations on 128-bit XMM register SSE only supports 4 packed single-precision floating-SSE only supports 4 packed single-precision floating-point valuespoint valuesSSE2 supports more:SSE2 supports more: 2 packed double-precision floating-point values 2 packed double-precision floating-point values 16 packed byte integers16 packed byte integers 8 packed word integers8 packed word integers 4 packed doubleword integers4 packed doubleword integers 2 packed quadword integers2 packed quadword integers Double quadwordDouble quadwordPackingPacking128 bits (word = 2 bytes)128 bits (word = 2 bytes)Quad wordQuad wordDouble word Double wordDouble word Double word64 bit64 bit32 bit 32 bit 32 bit 32 bitHardware Support for SSE2Hardware Support for SSE2Adder and Multiplier units in the SSE2 Adder and Multiplier units in the SSE2 engine are 128 bits wide, twice the width of engine are 128 bits wide, twice the width of that in Pentium3that in Pentium3Increased bandwidth in load/store for Increased bandwidth in load/store for floating-point valuesfloating-point valuesload and store are 128-bit wideload and store are 128-bit wideOne load plus one store can be completed One load plus one store can be completed between XMM register and L1 cache in one between XMM register and L1 cache in one clock cycle clock cycleSSE2 Instructions (1)SSE2 Instructions (1)Data movementsData movements Move data between XMM registers and between Move data between XMM registers and between XMM registers and memoryXMM registers and memoryDouble precision floating-point operationsDouble precision floating-point operations Arithmetic instructions on both scalar and Arithmetic instructions on both scalar and packed valuespacked valuesLogical InstructionsLogical InstructionsPerform logical operations on packed double Perform logical operations on packed double precision floating-point valuesprecision floating-point valuesSSE2 Instructions (2)SSE2 Instructions (2)Compare instructionsCompare instructionsCompare packed and scalar double precision floating-Compare packed and
View Full Document