Issue Logic and Power/Performance TradeoffsThe need for low-power architecturesA couple alternativesOther power throttling mechanismsMethodologyIssue Window ScalingPowerPoint PresentationSlide 8Slide 9Bounded RUU Impact on PerformanceBounded RUU impact on PowerPower/PerformanceAnalysisAdding a separate coreSlide 15Slide 16AM5x86 vs. K6Crusoe’s Voltage Scaling & Coast and BurnSlide 19Big ProvisoSlide 21Issue Logic and Power/Performance TradeoffsEdwin OlsonAndrew MenardDecember 5, 2000The need for low-power architecturesLow performance - PIMsHigh performance – video decoding/MP3 playbackAnd increasingly, both.–How do you design an architecture that can do both?A couple alternativesHigh performance processor that can be lobotomized–Modify Issue Logic–Change structure sizesTwo separate cores–A high performance/high-power core–A low performance/low-power coreOther power throttling mechanismsVoltage scaling–Huge power savings–There’s a limit & high performance designs are pushing towards low voltage– which doesn’t leave much room for throttling.Burn & Coast–Compute at full speed, and then go into a sleep mode. –Simple linear power/performance throttling.MethodologySimpleScalar/Wattch–Widely used but little/no verification. Several power models available, but very large margins of error. –Still, the size of structures is correlated to power consumption.Industry survey–Look at real-world processors with the range of characteristics of interest.SpecInt95–Substantially reduced input sets to make simulation feasible.Issue Window ScalingPopular idea- it’s a highly active chip structure. Window responsible for 20% of non-clock power (Alpha 21264 & Wattch agree)Does it work?–Let’s look at RUU usageWhat’s an upper bound on the useful size?How do smaller sizes impact performance and power?RUU size upper boundsModified SimpleScalar, let RUU be arbitrarily big.4-issue00.20.40.60.811.20 16 32 48 64RUU OccupancyFraction of Cyclesli perl compress mk88sim8-issue00.20.40.60.811.20 16 32 48 64RUU OccupancyFraction of Cyclesli perl compress mk88simEffect of bounded RUU sizeThe RUU’s occupancy “saturates” as one would expect. RUU Usage - li00.20.40.60.811.20 4 8 12 16 20 24 28 32RUU OccupancyC y c l e s16 Entry RUUUnlimited RUUEffect of Bounded RUU Sizemk88sim on 4-issue00.20.40.60.811.20 2 4 6 8 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4 5 5 5 5 5 6 6 6RUU SizeFraction of cycles4 8 16 32 64mk88sim on 8-issue00.20.40.60.811.20 2 4 6 8 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4 5 5 5 5 5 6 6 6RUU SizeFraction of cycles8 16 32 64Bounded RUU Impact on PerformancePerformance rapidly approaches maximum.8-issue needs a slightly larger RUU, as expected.IPC vs RUU size for 4-issue00.20.40.60.811.21.41.61.820 8 16 24 32 40 48 56 64RUU C apac ityliperlcompressm88ksimIPC vs RUU size for 8-issue00.20.40.60.811.21.41.61.822.22.40 8 16 24 32 40 48 56 64RUU C apacityliperlcompressm88ksimBounded RUU impact on PowerPower consumption increased in RUU as size increasesPow er Consum ption Breakdow ns for 4 issue on li0510152025304x4 li 4x8 li 4x16 li 4x32 li 4x64 liConfigurationPower (W)clockresultbusaludcache2dcacheicacheregfilelsqwindowbpredrenamePower/PerformanceThere’s a minimum! And it’s pretty much where maximum performance is. Hmmm.Structure 8x8 8x16 8x32 8x64Energy/Inst (li)13.8 12.5 13.4 14.9Energy/Inst (perl)15.1 14.7 15.8 17.6Energy/inst(compress)12.4 11.4 11.9 13.3Energy/inst(m88ksim)13.0 12.1 12.9 14.4AnalysisSome groups have advocated a variable 16-32 capacity RUU. Even if scaling is perfect, there’s little to be gained.A power-conscious architect is likely to be cornered into just one reasonable RUU size.Adding a separate coreIf we can’t lobotomize, perhaps we can add a completely separate CPU.Sounds like a good idea–Intuition: a simple in-order processor should have lower energy/instruction than a complex out-of-order one.–Small area overhead, around 1mm^2.Opportunity for more energy savings–Smaller register file–No issue window–Separate low-power caches (though this increases area)MethodologySimpleScalar/Wattch is all but useless–Availability of only one parameterizable power model (Wattch) and we don’t know what trade-offs the designer made.–Wattch doesn’t support sim-inorder–E.g., Cacti cache model uses 10x greater energy than Krste.Industry SurveyPowerPC StatisticsPPC440 is 2-issue, out of orderPPC405 is single issue, in-orderBoth use same technologyThe 440 is twice as fast, but uses only 1.66 times the power!AM5x86 vs. K65x86 is in-orderK6 is out-of-order, 6 issue, 24 entry windowK6 has slightly better power/performance–But it’s on a newer process (0.25um rather than 0.35)Crusoe’s Voltage Scaling & Coast and BurnCrusoe’s Voltage Scaling & Coast and BurnBig ProvisoCPUs available today, even the “low power” ones, are still after speed.–Low power IA32 is just a slower, high-power IA32.If you designed your simple core for super-low power (without very little regard for speed), how might this change?ConclusionSmaller issue windows are not a win on power; they lower the amount of ILP found by too much.Multiple cores are not a win on power; the faster core tends to be more energy
View Full Document