UVA CS 451 - POWER5 - D2656387

Home> Schools> University Of Virginia> Computer Science (CS) > CS 451> POWER5

UVA CS 451 - POWER5

Pages 72

Download Save

Unformatted text preview:

POWER5POWER5 LineagePOWER4Pipeline RequirementsPipeline ImprovementsSlide 6POWER5 Chip StatsSlide 8Slide 9PipelinePOWER5 PipelineInstruction Data FlowInstruction FetchBranch PredictionSlide 15Instruction GroupingGroup Dispatch & Register RenamingGroup TrackingLoad/Store Reorder QueuesInstruction IssueGroup CommitEnhancements to Support SMTSlide 23Slide 24Slide 25Slide 26Slide 27Slide 28Slide 29Single Threaded ModeRAS of POWER4RAS of POWER5Dynamic Power ManagementSlide 34Memory SubsystemCache SizesCache HierarchySlide 38Slide 39Important Notes on DiagramSlide 41Size does matter…compensating?Possible configurationsTypical ConfigurationsFabric Bus ControllerSlide 47Address BusResponse BusSlide 50Data BuseFuse TechnologyThe MethodMethod Cont.Slide 55Cons of eFusePros of eFuseBenchmarks: SPEC ResultsBenchmarksNotes on BenchmarksSlide 61POWER5 HypervisorHypervisor Cont.Hypervisor – ProcessorHV Processor Cont.Slide 66Hypervisor – I/OHV – I/O Cont.Hypervisor – SCSIHypervisor – NetworkHV – Network Cont.Hypervisor – ConsolePOWER5Ewen Cheslack-PostavaCase TaintorJake McPaddenPOWER5 LineageIBM 801 – widely considered the first true RISC processorPOWER1 – 3 chips wired together (branch, integer, floating point)POWER2 – Improved POWER1 – 2nd FPU and added cache and 128 bit mathPOWER3 – moved to 64 bit architecturePOWER4…POWER4Dual-coreHigh-speed connections to up to 3 other pairs of POWER4 CPUsAbility to turn off pair of CPUs to increase throughputApple G5 uses a single-core derivative of POWER4 (PowerPC 970)POWER5 designed to allow for POWER4 optimizations to carry overPipeline RequirementsMaintain binary compatibilityMaintain structural compatibilityOptimizations for POWER4 carry forwardImproved performanceEnhancements for server virtualizationImproved reliability, availability, and serviceability at chip and system levelsPipeline ImprovementsEnhanced thread level parallelismTwo threads per processor corea.k.a Simultaneous Multithreading (SMT)2 threads/core * 2 cores/chip = 4 threads/chipEach thread has independent access to L2 cacheDynamic Power ManagementReliability, Availability, and ServiceabilityPOWER5 Chip StatsCopper interconnectsDecrease wire resistance and reduce delays in wire dominated chip timing paths8 levels of metal389 mm2POWER5 Chip StatsSilicon on Insulator devices (SOI)Thin layer of silicon (50nm to 100 µm) on insulating substrate, usually sapphire or silicon dioxide (80nm)Reduces electrical charge transistor has to move during switching operation (compared to CMOS)•Increased speed (up to 15%)•Reduced switching energy (up to 20%)•Allows for higher clock frequencies (> 5GHz)SOI chips cost more to produce and are therefore used for high-end applicationsReduces soft errorsPipelinePipeline identical to POWER4All latencies including branch misprediction penalty and load-to-use latency with L1 data cache hit same as POWER4POWER5 PipelineIF – instruction fetch, IC – instruction cache, BP – branch predict, Dn – decode stage, Xfer – transfer, GD – group dispatch, MP – mapping, ISS – instruction issue, RF – register file read, EX – execute, EA – compute address, DC – data cache, F6 – six cycle floating point unit, Fmt – data format, WB – write back, CP – group commitInstruction Data FlowLSU – load/store unit, FXU – fixed point execution unit, FPU – floating point unit, BXU – branch execution unit, CRL – condition register logical execution unitInstruction FetchFetch up to 8 instructions per cycle from instruction cacheInstruction cache and instruction translation shared between threadsOne thread fetching per cycleBranch PredictionThree branch history tables shared by 2 threads1 bimodal, 1 path-correlated prediction1 to predict which of the first 2 is correctCan predict all branches – even if every instruction fetched is a branchBranch PredictionBranch to link register (bclr) and branch to count register targets predicted using return address stack and count cache mechanismAbsolute and relative branch targets computed directly in branch scan functionBranches entered in branch information queue (BIQ) and deallocated in program orderInstruction GroupingSeparate instruction buffers for each thread24 instructions / buffer5 instructions fetched from 1 thread’s buffer and form instruction groupAll instructions in a group decoded in parallelGroup Dispatch & Register RenamingWhen all resources necessary for group are available, group is dispatched (GD)D0 – GD: instructions still in program orderMP – register renaming, registers mapped to physical registersRegister files shared dynamically by two threadsIn ST mode all registers are available to single threadPlaced in shared issue queuesGroup TrackingInstructions tracked as group to simplify tracking logicControl information placed in global completion table (GCT) at dispatchEntries allocated in program order, but threads may have intermingled entriesEntries in GCT deallocated when group is committedLoad/Store Reorder QueuesLoad reorder queue (LRQ) and store reorder queue (SRQ) maintain program order of loads/stores within a threadAllow for checking of address conflicts between loads and storesInstruction IssueNo distinction made between instructions for different threadsNo priority difference between threadsIndependent of GCT group of instructionUp to 8 instructions can issue per cycleInstructions then flow through execution units and write back stageGroup CommitGroup commit (CP) happens whenall instructions in group have executed without exceptions an d the group is the oldest group in its threadOne group can commit per cycle from each threadEnhancements to Support SMTInstruction and data caches same size as POWER4 but double to 2 and 4 way associativity respectivelyIC and DC entries can be fully shared between threadsEnhancements to Support SMTTwo step address translationEffective address  Virtual Address using 64 entry segment lookaside buffer (SLB)Virtual address  Physical Address using hashed page table, cached in a 1024 entry four way set associative TLBTwo first level translation tables (instruction, data)SLB and TLB only used in case of first-level missEnhancements to Support SMTFirst Level Data Translation Table –

View Full Document


School:
Email:
New Password:
Confirm Password:

UVA CS 451 - POWER5

Sign up for free to view:

Please select your school