Edwin OlsonAndrew MenardDecember 5, 2000l Low performance - PIMsl High performance – video decoding/MP3 playbackl And increasingly, both.– How do you design an architecture that can do both?l High performance processor that can be lobotomized– Modify Issue Logic– Change structure sizesl Two separate cores– A high performance/high-power core– A low performance/low-power corel Voltage scaling– Huge power savings– There’s a limit & high performance designs are pushing towards low voltage– which doesn’t leave much room for throttling.l Burn & Coast– Compute at full speed, and then go into a sleep mode. – Simple linear power/performance throttling.l SimpleScalar/Wattch– Widely used but little/no verification. Several power models available, but very large margins of error. – Still, the size of structures is correlated to power consumption.l Industry survey– Look at real-world processors with the range of characteristics of interest.l SpecInt95– Substantially reduced input sets to make simulation feasible.l Popular idea- it’s a highly active chip structure. Window responsible for 20% of non-clock power (Alpha 21264 & Wattch agree)l Does it work?– Let’s look at RUU usagel What’s an upper bound on the useful size?l How do smaller sizes impact performance and power?l Modified SimpleScalar, let RUU be arbitrarily big.4-issue00.20.40.60.811.20 16324864RUU OccupancyFraction of Cyclesli perl compress mk88sim8-issue00.20.40.60.811.20 16324864RUU OccupancyFraction of Cyclesli perl compress mk88siml The RUU’s occupancy “saturates” as one would expect. RUU Usage - li00.20.40.60.811.20 4 8 121620242832RUU OccupancyCycles16 Entry RUUUnlimited RUUmk88sim on 4-issue00.20.40.60.811.2024681111122222333334444455555666RUU SizeFraction of cycles4 8 16 32 64mk88sim on 8-issue00.20.40.60.811.2024681111122222333334444455555666RUU SizeFraction of cycles8 16 32 64l Performance rapidly approaches maximum.l 8-issue needs a slightly larger RUU, as expected.IPC vs RUU size for 4-issue00.20.40.60.811. 21. 41. 61. 820 8 16 24 32 40 48 56 64RUU Capa c it yliperlcompressm88ksimIPC vs RUU s iz e f o r 8- is s u e00.20.40.60.811. 21. 41. 61. 822.22.40 8 16 24 32 40 48 56 64RUU Capacityliperlcompressm88ksiml Power consumption increased in RUU as size increasesPower Consumption Breakdowns for 4 issue on li0510152025304x4 li 4x8 li 4x16 li 4x32 li 4x64 liConfigurationPower (W)clockresultbusaludcache2dcacheicacheregfilels qwindo wbpredrenamel There’s a minimum! And it’s pretty much where maximum performance is. Hmmm.Structure 8x8 8x16 8x32 8x64Energy/Inst (li)13.8 12.5 13.4 14.9Energy/Inst (perl)15.1 14.7 15.8 17.6Energy/inst(compress)12.4 11.4 11.9 13.3Energy/inst(m88ksim)13.0 12.1 12.9 14.4l Some groups have advocated a variable 16-32 capacity RUU. Even if scaling is perfect, there’s little to be gained.l A power-conscious architect is likely to be cornered into just one reasonable RUU size.l If we can’t lobotomize, perhaps we can add a completely separate CPU.l Sounds like a good idea– Intuition: a simple in-order processor should have lower energy/instruction than a complex out-of-order one.– Small area overhead, around 1mm^2.l Opportunity for more energy savings– Smaller register file– No issue window– Separate low-power caches (though this increases area)l SimpleScalar/Wattch is all but useless– Availability of only one parameterizable power model (Wattch) and we don’t know what trade-offs the designer made.– Wattch doesn’t support sim-inorder– E.g., Cacti cache model uses 10x greater energy than Krste.l Industry Surveyl PPC440 is 2-issue, out of orderl PPC405 is single issue, in-orderl Both use same technologyl The 440 is twice as fast, but uses only 1.66 times the power!l 5x86 is in-orderl K6 is out-of-order, 6 issue, 24 entry windowl K6 has slightly better power/performance– But it’s on a newer process (0.25um rather than 0.35)l CPUs available today, even the “low power” ones, are still after speed.– Low power IA32 is just a slower, high-power IA32.l If you designed your simple core for super-low power (without very little regard for speed), how might this change?l Smaller issue windows are not a win on power; they lower the amount of ILP found by too much.l Multiple cores are not a win on power; the faster core tends to be more energy
View Full Document