1Lecture 19: Core Design• Today: issue queue, ILP, clock speed, ILP innovationsWakeup LogicrdyL rdyRtagRtagLor= =ortag1 tagIW…rdyL rdyRtagRtagL......Selection LogicIssue windowreq grantanyreqenableenableArbiter cell• For multiple FUs, will need sequential selectors4Structure Complexities• Critical structures: register map tables, issue queue, LSQ, register file,register bypass• Cycle time is heavily influenced by:window size (physical register size), issue width (#FUs)• Conflict between the desire to increase IPC and clock speed• Can achieve both if we use large structures and deeppipelining; but, some structures can’t be easily pipelined andlong-latency structures can also hurt IPC5Deep Pipelines• What does it mean to have 2-cycle wakeup 2-cycle bypass 2-cycle regreadScaling Options20-IQ40RegsFFFF20-IQ40RegsFFFF2-cycle wakeup2-cycle regread2-cycle bypass15-IQ30RegsFFF15-IQ30RegsFFF15-IQ30RegsFFFPipeline ScalingCapacity ScalingReplicated CapacityScaling7Recent Trends• Not much change in structure capacities• Not much change in cycle time• Pipeline depths have become shorter (circuit delays havereduced); this is good for energy efficiency• Optimal performance is observed at about 50 pipelinestages (we are currently at ~20 stages for energy reasons)• Deep pipelines improve parallelism (helps if there’s ILP);Deep pipelines increase the gap between dependentinstructions (hurts when there is little ILP)ILP Limits Wall 19939Techniques for High ILP• Better branch prediction and fetch (trace cache) cascading branch predictors?• More physical registers, ROB, issue queue, LSQ two-level regfile/IQ?• Higher issue width clustering?• Lower average cache hierarchy access time• Memory dependence prediction• Latency tolerance techniques: ILP, MLP, prefetch, runahead,multi-threadingImpact of Mem-Dep Prediction• In the perfect model, loads only wait for conflictingstores; in naïve model, loads issue speculatively and mustbe squashed if a dependence is later discoveredFrom Chrysos and Emer, ISCA’98ClusteringReg-rename &Instr steerIQRegfileF FIQRegfileF Fr1 r2 + r3r4 r1 + r2r5 r6 + r7r8 r1 + r5p21 p2 + p3p22 p21 + p2p42 p21p41 p56 + p57p43 p42 + p4140 regs in each clusterr1 is mapped to p21 and p42 – will influence steering and instr commit – on average, only 8 replicated regs2Bc-gskew Branch PredictorAddressAddress+HistoryBIMMetaG1G0PredVote44 KB; 2-cycle access; used in the Alpha 21464Rules• On a correct prediction if all agree, no update if they disagree, strengthen correct preds andchooser• On a misprediction update chooser and recompute the prediction on a correct prediction, strengthen correctpreds on a misprediction, update all predsRunahead Mutlu et al., HPCA’03TraceCacheCurrentRenameIssueQRegfile (128)CheckpointedRegfile (32)RetiredRenameROBFUsL1 DRunaheadCacheWhen the oldest instruction is a cache miss, behave like itcauses a context-switch: • checkpoint the committed registers, rename table, returnaddress stack, and branch history register• assume a bogus value and start a new thread• this thread cannot modify program state, but can prefetchMemory Bottlenecks• 128-entry window, real L2 0.77 IPC• 128-entry window, perfect L2 1.69• 2048-entry window, real L2 1.15• 2048-entry window, perfect L2 2.02• 128-entry window, real L2, runahead 0.9416Title•
View Full Document