The AMD Hammer Processor CoreChetana N. KeltcherMember of Technical StaffAdvanced Micro DevicesAug 2002 AMD Hammer Processor Core HotChips 142Hammer Architecture Overview• First x86-64 based processor• Aggressive out-of-order, 9-issuesuperscalar processor• Integrated DDR memory controller• Leading performance in integer, floatingpoint and multimedia– x86-64, x87, MMX™, 3DNow!™, SSE,SSE2L2CacheL1Instruct.CacheL1DataCacheHammerProcessorCoreHyperTransport™ technologyDDR Memory ControllerHammer ArchitectureAug 2002 AMD Hammer Processor Core HotChips 143Hammer Core OverviewL1Icache64KBFetchInt Decode & RenameµOPs36-entry FP schedulerFADD FMISCFMULBranchPrediction44-entryLoad/StoreQueueL2CacheInstruction Control Unit (72 entries)L1Dcache64KBFastpathMicrocode EngineScan/AlignFP Decode & RenameAGUALUAGUALUMULTAGUALURes Res ResCrossbarMemoryControllerHyperTransportTMSystemRequestQueueAug 2002 AMD Hammer Processor Core HotChips 144Instruction Fetch• Supply 16 instruction bytes tothe decoder per cycle• 64KB instruction cache,2-way set associative– Linearly-indexed, physically-tagged,64-byte block size– Prefetch next sequential blockon a miss• 2 sets of instruction cache tags (fetch port, snoop)• Predecode instruction– 1 end bit per-byte– Decode some branch types• Branch predictionFetchIntBranchPrediction44-entryLoad/StoreQueueL2Cache256KB –1MInstruction Control Unit (72 entries)L1Dcache64KBFastpathMicrocode EngineScan/AlignFPCrossbarMemoryControllerHyperTransportTMSystemRequestQueueL1Icache64KBAug 2002 AMD Hammer Processor Core HotChips 145Branch Prediction• Sequential Fetch• Predicted Fetch• Branch TargetAddress Calculator Fetch• Mispredicted Fetch• 5-10% improvement inprediction accuracy vs.AMD Athlon™L2 Cache BranchSelectorsEvicted DataBranchSelectorsGlobalHistoryCounter(16k, 2-bitcounters)Target Array(2k targets)12-entryReturn AddressStack (RAS)Branch TargetAddress Calculator(BTAC)ExecutionstagesPickDEC1DEC2PackEDECDispatchIssueExecuteRedirectAug 2002 AMD Hammer Processor Core HotChips 146Scan / Align• Convert x86 instructions to fixedlength µOPs• Dispatch 3 µOPs per cycle tointeger/FP schedulers• Instructions use one of twodecoding pipelines– Fastpath: instructions decoding to twoor fewer µOPs are decoded by hardware,packed into 3 dispatch positions– Microcode: x86 instructions decoding to more than two µOPs, calculate ROM entrypoint, fetch sequence from ROM• Compared to AMD Athlon™, more instructions use the fastpath– Eg: Packed SSE is microcoded in AMD Athlon and fastpath in Hammer– Hammer has 8% fewer microcoded instructions for Specint2000– Hammer has 28% fewer microcoded instructions for Specfp2000L1Icache64KBFetchIntBranchPrediction44-entryLoad/StoreQueueL2CacheInstruction Control Unit (72 entries)L1Dcache64KBFastpathMicrocode EngineScan/AlignFPCrossbarMemoryControllerHyperTransportTMSystemRequestQueueAug 2002 AMD Hammer Processor Core HotChips 147Execution Units• 3 integer units• 3 address generation units• 3 superscalar floating point units• Integer– Full 64-bit data path– 3 x 8-entry reservation stations– Single cycle 32 and 64-bit add, sub,rotate, shift, logical, etc.– 32-bit multiply: 3 cycle latency– 64-bit multiply: 5 cycle latency• Floating point– Handles x87, MMX™, 3DNow!™, SSE and SSE2– 36-entry scheduler– Out-of-order, fully pipelined designL1Icache64KBFetchIntSchedulerFAD FMIFMUBranchPrediction44-entryLoad/StoreQueueL2CacheInstruction Control Unit (72 entries)L1Dcache64KBFastpathMicrocode EngineScan/AlignFPAGUALUAGUALUMULAGUALURes Res ResCrossbarMemoryControllerHyperTransportTMSystemRequestQueueAug 2002 AMD Hammer Processor Core HotChips 148Load/Store and Data Cache• 64KB data cache– 2-way set associative– Linearly-indexed, physically-tagged– 40-bit physical address– 48-bit linear address– MOESI coherency– 64-byte block size• Banked and dual ported– 2 64-bit reads/writes each cycle to different banks• 3 sets of data cache tags (port A, port B, snoop)• Load->use latency is 3 cycles (zero segment base)– 1 extra cycle to handle misaligned (quadword boundary) loads• Data forwarding from stores to dependent loads• Hardware prefetchL1Icache64KBFetchIntBranchPrediction44-entryLoad/StoreQueueL2CacheInstruction Control Unit (72 entries)L1Dcache64KBFastpathMicrocode EngineScan/AlignFPCrossbarMemoryControllerHyperTransportTMSystemRequestQueueAug 2002 AMD Hammer Processor Core HotChips 149L2 Cache• Configurable sizes up to 1MB• 16-way set associative• L1 and L2 storage is mutuallyexclusive• Pseudo-LRU scheme to reduce thenumber of LRU bits by half• Stores IC predecode and branchprediction bits• 10 outstanding miss requests–8 DC– 2 IC• System interface– Victim Buffer (8-entry)– Snoop Buffer (8-entry)– Write Buffer (4-entry)L1Icache64KBFetchIntBranchPrediction44-entryLoad/StoreQueueL2CacheInstruction Control Unit (72 entries)L1Dcache64KBFastpathMicrocode EngineScan/AlignFPCrossbarMemoryControllerHyperTransportTMSystemRequestQueueAug 2002 AMD Hammer Processor Core HotChips 1410TLB for Large WorkloadsTLBReload24-entryPage DescriptorCachePML4, PDP, PDEL2 CacheFlush FilterCAM32 EntryCR3, PDP, PDE Snoop ModifyTable WalkPDC ReloadTLBReloadL1 Instruction TLB40 EntryFully Associative4M/2M & 4k pagesL2 Instruction TLB512-entry4-way associative4k pagesASNL1 Data TLB40 EntryFully Associative4M/2M & 4k pagesL2 Data TLB512-entry4-way associative4k pagesSignSignPML4PML4PDPPDPPDEPDEPTEPTEOffsetOffset4863 47 39 38 30 29 21 20 12 11 0Aug 2002 AMD Hammer Processor Core HotChips 1411Integrated Memory Controller• Integrated DDR memory controller– 8-byte or 16-byte interface– Unbuffered or Registered DIMMs– 16-byte interface supports direct connection to 8 registered DIMMs and chipkill ECC– Significantly reduces memory latency– Memory latency improves as CPU andHyperTransport™ link speed improves– Performance improves by approximately 20%compared to AMD Athlon™ topology– Snoop throughput scales with CPU frequency• Integrated Northbridge Functionality– Processes requests from CPU/IO to DRAM/IO– HyperTransport™ routing• peak bandwidth = 6.4GB/s– Handles transaction ordering and cachecoherence– Runs at the same frequency as CPU coreSystem RequestQueueCrossbarHyperTransportTMlinks
View Full Document