POWER5POWER5 LineagePOWER4Pipeline RequirementsPipeline ImprovementsSlide 6POWER5 Chip StatsSlide 8Slide 9PipelinePOWER5 PipelineInstruction Data FlowInstruction FetchBranch PredictionSlide 15Instruction GroupingGroup Dispatch & Register RenamingGroup TrackingLoad/Store Reorder QueuesInstruction IssueGroup CommitEnhancements to Support SMTSlide 23Slide 24Slide 25Slide 26Slide 27Slide 28Slide 29Single Threaded ModeRAS of POWER4RAS of POWER5Dynamic Power ManagementSlide 34Memory SubsystemCache SizesCache HierarchySlide 38Slide 39Important Notes on DiagramSlide 41Size does matter…compensating?Possible configurationsTypical ConfigurationsFabric Bus ControllerSlide 47Address BusResponse BusSlide 50Data BuseFuse TechnologyThe MethodMethod Cont.Slide 55Cons of eFusePros of eFuseBenchmarks: SPEC ResultsBenchmarksNotes on BenchmarksSlide 61POWER5 HypervisorHypervisor Cont.Hypervisor – ProcessorHV Processor Cont.Slide 66Hypervisor – I/OHV – I/O Cont.Hypervisor – SCSIHypervisor – NetworkHV – Network Cont.Hypervisor – ConsolePOWER5Ewen Cheslack-PostavaCase TaintorJake McPaddenPOWER5 LineageIBM 801 – widely considered the first true RISC processorPOWER1 – 3 chips wired together (branch, integer, floating point)POWER2 – Improved POWER1 – 2nd FPU and added cache and 128 bit mathPOWER3 – moved to 64 bit architecturePOWER4…POWER4Dual-coreHigh-speed connections to up to 3 other pairs of POWER4 CPUsAbility to turn off pair of CPUs to increase throughputApple G5 uses a single-core derivative of POWER4 (PowerPC 970)POWER5 designed to allow for POWER4 optimizations to carry overPipeline RequirementsMaintain binary compatibilityMaintain structural compatibilityOptimizations for POWER4 carry forwardImproved performanceEnhancements for server virtualizationImproved reliability, availability, and serviceability at chip and system levelsPipeline ImprovementsEnhanced thread level parallelismTwo threads per processor corea.k.a Simultaneous Multithreading (SMT)2 threads/core * 2 cores/chip = 4 threads/chipEach thread has independent access to L2 cacheDynamic Power ManagementReliability, Availability, and ServiceabilityPOWER5 Chip StatsCopper interconnectsDecrease wire resistance and reduce delays in wire dominated chip timing paths8 levels of metal389 mm2POWER5 Chip StatsSilicon on Insulator devices (SOI)Thin layer of silicon (50nm to 100 µm) on insulating substrate, usually sapphire or silicon dioxide (80nm)Reduces electrical charge transistor has to move during switching operation (compared to CMOS)•Increased speed (up to 15%)•Reduced switching energy (up to 20%)•Allows for higher clock frequencies (> 5GHz)SOI chips cost more to produce and are therefore used for high-end applicationsReduces soft errorsPipelinePipeline identical to POWER4All latencies including branch misprediction penalty and load-to-use latency with L1 data cache hit same as POWER4POWER5 PipelineIF – instruction fetch, IC – instruction cache, BP – branch predict, Dn – decode stage, Xfer – transfer, GD – group dispatch, MP – mapping, ISS – instruction issue, RF – register file read, EX – execute, EA – compute address, DC – data cache, F6 – six cycle floating point unit, Fmt – data format, WB – write back, CP – group commitInstruction Data FlowLSU – load/store unit, FXU – fixed point execution unit, FPU – floating point unit, BXU – branch execution unit, CRL – condition register logical execution unitInstruction FetchFetch up to 8 instructions per cycle from instruction cacheInstruction cache and instruction translation shared between threadsOne thread fetching per cycleBranch PredictionThree branch history tables shared by 2 threads1 bimodal, 1 path-correlated prediction1 to predict which of the first 2 is correctCan predict all branches – even if every instruction fetched is a branchBranch PredictionBranch to link register (bclr) and branch to count register targets predicted using return address stack and count cache mechanismAbsolute and relative branch targets computed directly in branch scan functionBranches entered in branch information queue (BIQ) and deallocated in program orderInstruction GroupingSeparate instruction buffers for each thread24 instructions / buffer5 instructions fetched from 1 thread’s buffer and form instruction groupAll instructions in a group decoded in parallelGroup Dispatch & Register RenamingWhen all resources necessary for group are available, group is dispatched (GD)D0 – GD: instructions still in program orderMP – register renaming, registers mapped to physical registersRegister files shared dynamically by two threadsIn ST mode all registers are available to single threadPlaced in shared issue queuesGroup TrackingInstructions tracked as group to simplify tracking logicControl information placed in global completion table (GCT) at dispatchEntries allocated in program order, but threads may have intermingled entriesEntries in GCT deallocated when group is committedLoad/Store Reorder QueuesLoad reorder queue (LRQ) and store reorder queue (SRQ) maintain program order of loads/stores within a threadAllow for checking of address conflicts between loads and storesInstruction IssueNo distinction made between instructions for different threadsNo priority difference between threadsIndependent of GCT group of instructionUp to 8 instructions can issue per cycleInstructions then flow through execution units and write back stageGroup CommitGroup commit (CP) happens whenall instructions in group have executed without exceptions an d the group is the oldest group in its threadOne group can commit per cycle from each threadEnhancements to Support SMTInstruction and data caches same size as POWER4 but double to 2 and 4 way associativity respectivelyIC and DC entries can be fully shared between threadsEnhancements to Support SMTTwo step address translationEffective address Virtual Address using 64 entry segment lookaside buffer (SLB)Virtual address Physical Address using hashed page table, cached in a 1024 entry four way set associative TLBTwo first level translation tables (instruction, data)SLB and TLB only used in case of first-level missEnhancements to Support SMTFirst Level Data Translation Table –
View Full Document