Unformatted text preview:

Pentium Pro Case Study Pentium Pro Case Study Prof Mikko H Lipasti University of Wisconsin Madison Lecture notes based on notes by John P Shen Updated by Mikko Lipasti Microarchitecture Order 3 Superscalar Out of Order execution Speculative execution In order completion Design Methodology Performance Analysis Retrospective Goals of P6 Microarchitecture BTB ICU P6 The Big Picture BAC Rename Allocation IA 32 Compliant Fetch 4 Cycles Decode 2 Cycles Dispatch 2 Cycles Reservation Station 20 Performance Frequency IPC 4 3 2 1 2 Cycles 0 Validation Die Size 2 cyc AGU0 AGU1 ROB 40x157 Schedule IEU1 IEU0 JEU Fadd Power MOB Fmul RRF Imul DCU Div Instruction Fetch Memory Hierarchy L2 Cache 256Kb DCache 8Kb L2 Cache 256Kb Level 1 instruction and data caches 2 cycle access time Level 2 unified cache 6 cycle access time Separate level 2 cache and memory address data bus ECE CS 752 Advanced Computer Architecture I Victim Cache Physical Addr Length Marks Inst Instruction TLB Inst Buf Inst Length Decoder Inst Rotator Next Addr Logic 16 bytes ICache 8Kb Instruction Data CPU BIU Prediction ICache 8KB Fetch Address 64 bit Prediction Marks Stream Buffer Other Fetch Requests 16 bytes marks PCI Instruction Data Mux 16 bytes Main Memory To Decode Branch Target Branch Target Buffer 512 2 cycle 1 Branch Target Buffer Instruction Tag Array Hit Miss Fetch Address ITLB State Machine 0 11 10 Br Pred 1 1 0 1 16 entries set Target Addr Br Offset Fetch Addr Tag Target Addr 4 bit BHR spec Br Offset 4 bit BHR Pattern History Table PHT is not speculatively updated A speculative Branch History Register BHR and prediction state is maintained Uses speculative prediction state if it exist for that branch Instruction Decode 1 Branch Execution Br History 0 Prediction Target Addr Macro Instruction Bytes from IFU 0010 1 Prediction Control Logic Return Stack Branch Prediction Algorithm 0 Target Addr Br Offset Tag Compare Data Mux 0 PHT Way 3 Lower 12 bits Victim Cache Instruction Data Way 1 Fetch Addr Tag ICache 8 Kb 4 bit BHR Fetch Address Upper 20 bits 4 bit BHR spec 128 Sets Stream Buffer Fetch Addr Tag Way 0 4 bit BHR spec Lower 12 bits Bus Interface Unit 4 bit BHR Instruction Cache Unit 1 16 bytes To Next Address Calc 0 uROM 1 0101 Spec Pred Speculative History Instruction Buffer Pattern Table 0000 0001 0010 0011 0100 0101 0110 Decoder 0 4 uops 1110 1111 Decoder 1 Decoder 2 1 uop 1 uop Branch Address Calc uop Queue 6 Current prediction updates the speculative history prior to the next instance of the branch instruction Branch History Register BHR is updated during branch execution Branch instruction detection Branch recovery flushes front end and drains the execution core Branch address calculation Static prediction and branch always execution Branch mis prediction resets the speculative branch history state to match BHR One branch decode per cycle break on branch Up to 3 uops Issued to dispatch Instruction Decode 2 What is a uop Macro Instruction Bytes from IFU Small two operand instruction Very RISC like Instruction Buffer uROM Decoder 0 4 uops Decoder 1 Decoder 2 1 uop 1 uop 16 bytes To Next Address Calc Branch Address Calc uop Queue 6 IA 32 instruction add eax ebx Uop decomposition ld guop0 eax ld guop1 ebx add guop0 guop1 sta eax std guop0 MEM eax MEM eax MEM ebx guop0 MEM eax guop1 MEM ebx guop0 guop0 guop1 MEM eax guop0 Up to 3 uops Issued to dispatch Instruction Buffer contains up to 16 instructions which must be decoded and queued before the instruction buffer is re filled Macro instructions must shift from decoder 2 to decoder 1 to decoder 0 ECE CS 752 Advanced Computer Architecture I 2 Dispatch Buffer 3 uoP Queue 6 Instruction Dispatch Renaming Register Renaming 1 To Reservation Station Mux Logic Integer RAT EAX EBX ECX Real Register File RRF EAX EBX 8 ECX GuoP0 GuoP1 Retirement Info Allocator 2 cycles 8 FST0 FST1 12 GuoP0 GuoP1 4 9 IuoP 0 3 CC Events Floating Point RAT FST0 FST1 FST2 Reorder Buffer ROB 0 1 2 3 4 5 6 7 8 9 39 Register Renaming FST7 Allocation requirements 3 or none Reorder buffer entries Reservation station entry Load buffer or store buffer entry Similar to Tomasulo s Algorithm Uses ROB entry number as tags Dispatch buffer probably dispatches all 3 uops before re fill Execution results are stored in the ROB The register alias tables RAT maintain a pointer to the most recent data for the renamed register Register Renaming Example Real Register File RRF EAX EBX ECX 8 8 FST0 FST1 12 GuoP0 GuoP1 4 9 IuoP 0 3 CC Events Integer RAT EAX EBX ECX Reorder Buffer ROB 0 1 2 GuoP0 3 GuoP1 4 5 sub Comp 6 7 Floating Point RAT Alloc 8 FST0 9 FST1 FST2 39 Challenges to Register Renaming GuoP0 GuoP1 Completing sub eax ecx Shen Lipasti 8 FST0 FST1 12 GuoP0 GuoP1 4 9 IuoP 0 3 CC Events 8 bit code mov mov add add 15 Out of Order Execution Engine RS bypass Reservation Station 20 4 3 2 AGU1 AGU0 1 JEU AL data1 AH data2 AL data3 AL data4 Byte addressable registers From Dispatch Queue Port 4 MOB FST7 Reservation Station 2 Cycles 0 IEU1 Floating Point RAT FST0 FST1 FST2 Reorder Buffer ROB 0 1 2 3 4 5 6 7 8 9 39 FST7 Dispatching add eax ebx add eax ecx fxch f0 f1 Integer RAT EAX EBX ECX Real Register File RRF EAX EBX 8 ECX Port 3 Port 2 Port 1 Port 0 Cycle 1 IEU0 Fadd Cycle 2 Fmul Imul DCU 8Kb In order branch issue and execution In order load store issue to address generation units Div Instruction execution and result bus scheduling Is the reservation station truly centralized what is binding ECE CS 752 Advanced Computer Architecture I To Execution Units Cycle 1 Cycle 2 Order checking Operand availability Writeback bus scheduling 3 Memory Ordering Buffer MOB AGU0 AGU1 Load Buffer 16 Store Address Buffer 12 Conflict Logic Instruction Completion R S Store Data Buffer 12 Handles all exception interrupt trap conditions Handles branch recovery OOO core drains out right path instructions commits to RRF In parallel front end starts fetching from target fall through However no renaming is allowed until OOO core is drained After draining is done RAT is reset to point to RRF 2 cycle Data Cache Unit 8Kb Avoids checkpointing RAT recovering to intermediate RAT state Commits execution results to the architectural state in order Retirement Register File RRF MOB Must handle hazards to RRF writes reads in same cycle Bypass Logic Control Must handle hazards to RAT writes reads in same cycle 2 cycle Load Data Result Load buffer retains loads until completed for coherency checking Store forwarding out of store buffers 2 cycle latency


View Full Document

UW-Madison ECE/CS 752 - Pentium Pro Case Study

Download Pentium Pro Case Study
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Pentium Pro Case Study and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Pentium Pro Case Study and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?