1Lecture 12Architectures for Low Power: Transmeta’s Crusoe ProcessorMotivation Exponential performance increase at a low cost However, for some application areas low power consumption is more important than performance:Mobile communicationsMobile computingWireless InternetMedical implantsDeep space applications Battery life time Trading area/performance for power Power can be reduced by decreasing the supply voltage and allowing the performance to degrade. Trading performance for power But these techniques incur an area penalty. Trading area for powerDesigning for Low Power:Approaches2 Avoiding waste Avoiding waste Clocking module when they are idle Glitching Using dedicated rather than programmable hardware Reducing control overhead by using regular algorithms and architectures Designing systems to meet performance requirementsDesigning for Low Power:Approaches Exploiting locality Global operations inherently consume a lot of power. Data must be transferred from one part of the chip to another at the expense of switching large bus capacitances. A design partitioned to exploit locality of reference can minimize the amount of expensive global communications employed in favor of much less costly local interconnect networks.Designing for Low Power:ApproachesCrusoe Family of Processorsfrom TransmetaIntroductionSoftware: Code MorphingHardware: VLIW corePerformanceApplications3The Idea – David DitzelChampion of simple chip architecture.1995Chief Technical Officer of Sun MicroSystems Inc.’sSparc Business.Working on emulation of x86 software on SparcProcessors.The Idea – David DitzelEarly 1995 left Sun and worked on his own idea.Was not happy with the complexity of the architectures of recent times.Some new ideas mixed with some old ideas to build a simple and fast architecture capable of running x86 code.Software hardware hybrid.The Company - TransmetaDitzel and Colin Hunter choose the nameTransmeta and the company was formed in Summer of 1995.Use of contacts in the industry to recruit top brains for the ideas.Design started in the living rooms of the founders homes.Now employs many people.4InnovationTransmeta Crusoe chipx86 Emulation Very Long Instruction Word (VLIW) Code Morphing Simple ArchitectureLongRun TechnologyVirtual DevicesLow PowerIntroducing a Software LayerSoftware:Code MorphingPerforms dynamic binary translation. Compiles instructions from one instruction set architecture (ISA) to another ISA.5Code Morphingx86 binary codex86 binary codeCode Morphing SoftwareCode Morphing SoftwareVLIW binary codeVLIW binary codeDecoding and SchedulingCode morphing translates an entire group of x86 instructions at once and stores the translation in a translation cache for future reference.Conventional x86 superscalar processors fetch binary instructions and decode them into separate micro-operations. Then they are reordered by the hardware and executed in parallel.Decoding and Scheduling6Decoding and SchedulingThe translation step introduces many opportunities.Due to high repeat rates the translation cache isfrequently used to reduce overhead.Can use much more sophisticated scheduling algorithms.Much lower power consumption because translation is all in software.Can optimize generated code, and by ‘learning’ which parts are executed often, can change levels of optimization dynamically.Instruction Set EmulationEmulation is traditionally slow because of the way different ISAs handle condition codes and exceptions.Crusoe uses specific registers to emulate setting of condition codes by the processor(.c suffix is used after the instruction to show that condition codes need to be set).Exceptions are handled by using shadow registers, and a procedure called “commit and rollback”Translationby code morphing softwareTranslation Step 1Ld %r30, [%esp]Add.c %eax, %eax, %r30Ld %r31, [%esp]Add.c %ebx, %ebx, %r31Ld %esi, [%ebp]Sub.c %ecx, %ecx, 5Original x86 codeNative VLIW codeAddl %eax, (%esp)Addl %ebx, (%esp)Movl %esi, (%ebp)Subl %ecx, 57OptimisationElimination of atoms + extra condition code options.Translation Step 2Ld %r30, [%esp]Add %eax, %eax, %r30Add %ebx, %ebx, %r30Ld %esi, [%ebp]Sub.c %ecx, %ecx, 5Optimized Native VLIW codeNative VLIW codeLd %r30, [%esp]Add.c %eax, %eax, %r30Ld %r31, [%esp]Add.c %ebx, %ebx, %r31Ld %esi, [%ebp]Sub.c %ecx, %ecx, 5Translation Step 31. Ld %r30, [%esp]; Sub.c %ecx, %ecx, 52. Ld %esi, [%ebp]; Add %eax, %eax, %r30; Add %ebx, %ebx, %r30Scheduling -remaining atoms into molecules using a large window.Scheduled Native VLIW codeOptimized Native VLIW codeLd %r30, [%esp]Add %eax, %eax, %r30Add %ebx, %ebx, %r30Ld %esi, [%ebp]Sub.c %ecx, %ecx, 5Software’s Edge Molecules explicitly encode the instruction-level parallelism, hence they can be executed by a simple VLIW engine. The hardware doesn’t need to perform complex instruction reordering. Simplicity means fast and low-power design. Processor upgrades are simplified. Software layer means that software developers don’t have to recompile programs. New hardware architecture only needs a new code morphing software from Transmeta.8Software’s Edge Code morphing software can be upgraded independently into flash ROM. Software layer helps debugging process.There are different ways to perform the same function so software can be changed in debug process. Software layer increases performance.Timing of critical paths are improved.Optimization is applied to remove unnecessary instructions.Software reordering can be done much better than hardware by looking at a bigger window of instructions and applying more complicated algorithms.Several ISA Allows you to mix instruction sets with ease because they are all emulated by the software.Hardware9Chip SimplificationsNo Superscalar decode, grouping or issue logic.No register renaming or segmentation hardware.No floating point stack hardware.No front end memory management.Less interlock and bypassing logic.10Hardware Specifications128 bit High performance VLIW engine2 Integer units (ALU’s)Floating point unitMemory unitBranch unitCode Morphing Hardware SupportHandling exceptions by shadowing.Commit and rollback.Gated Store Buffer.Aliasing Hardware.Protection for self modifying code.LongRun Technology.11TM5400
View Full Document