Macro instruction synthesis for embedded processorsMotivationRISC8 ArchitectureMethodologyDifferent Levels of expression treesExpression treesInstruction EnumerationMachine Code Level Tree ReconstructionSlide 9Table-Driven Assembly Development ToolsTable-driven back-end tool automationOp-Code ReuseImplementationBenchmarksGSM encoderConclusionsMacro instruction synthesis for embedded processorsPinhong ChenYunjian Jiang (william)- CS252 project presentationMotivationStart from a simple processor coreFind new macro instructions to enhance performance and reduce code sizeApplication-specificUsing dedicated hardware to speed upI/DMem.ALURegBus unitcontrolApplicationMacro Instr. Ext.ControlReg/MemAccessRISC8 ArchitectureWhy RISC8?Simple8-bit ISA with 43 InstructionsAddressable space 64K bytesComplete ISA, including Load/Store, Arithmetic, Logical , Branch, Multiplication,Division, Stack Operation, Subroutine call, Interrupt Operations, etc. Small Verilog core size is 3.5K gates in 0.25umclock speed of 300MHz is reported (our result is about 200MHz)Synthesizable RTL CoreFree assemblerMethodologyApplication (*.c)IR (exp. tree)Front endCode Gen.Asm. codesimulationmach. codeAssemblerperformanceInstr. Profiling RTL exp. treeIstr. SynIstr. SynIstr. SynDifferent Levels of expression treessum += c & 5ASSIGNADDVARANDVARVAR CONASSIGNADDbyte con08regaccaccregANDacc regbytebyteSUIF IR RTL IR after code genASSIGNADDVAR con08ANDaddr16MOVVARaddr16Reconstructed from mach. codeExpression treesSUIF IR•Data type carried•Inaccurate cost•No profiling•Simple – less tree nodes•Machine independentRegister level •Data type carried•One-to-one between macro instructions•Profiling data can be back annotated•Machine dependentMachine code •Data type lost•One-to-one between machine instructions•Profiling data accurate•Large expression trees•Machine dependentInstruction EnumerationTraverse tree structure in post-orderNormalize sub-tree ordersCombine patterns from sub-treesHash new instruction patternsCollect register usage and memory access for evaluationAnnotate profiling informationADDbyte con08accregANDacc regbyteMachine Code Level Tree ReconstructionBuild IR tree from machine codesRecover data dependencies from assembly codeClear definition by ISAeg. AND r2 ==> acc=acc & r2Limited to a basic blockEliminate intermediate storage nodesADDbyte con08accregANDacc regbyteMachine Code Level Tree ReconstructionADDbyte con08ANDbyteBuild IR tree from machine codesRecover data dependencies from assembly codeClear definition by ISAeg. AND r2 ==> acc=acc & r2Limited to a basic blockEliminate intermediate storage nodesSpecial Instr.Special Instr.Table-Driven Assembly Development ToolsAsm. codemach. codeAssemblerperformanceInstr. Profile DisassemblerSimulatorNew Instr. SelectInstr. TableNew Instruction CandidatesAsm. codeIstr. SynTable-driven back-end tool automation@new_ins=( 'mac'=>{otree=>['r0','nADD','r0',['nMUL','Rn','addr16']], pattern=>'Rn addr16', code=>['00000011','00000$Rn','$addr16[0]','$addr16[1]'], sim=>'$R[0]+=$R[$Rn]*$memory[$addr16]', cycles=>13, decode=>'$Rn=$memory[$pc++] & 0x7; $addr16[0]=$memory[$pc++]; $addr16[1]=$memory[$pc++]; $addr16=$addr16[0]|($addr16[1]<<8);‘});Op-Code ReuseOp codes may not be fully used in a specific applicationRemove un-used instruction op-codesTypical applications use far less than 256 op-codesCost of op-code reuseDecoding logicLess flexibilityapplication FIR ADPCM GSM max7219 LCD4x20 PRN-IOOpcodes 28 49 32 39 40 30Implementation Compiler front-end: SUIFCode generator: SPAM-oliveRetargeted to RISC8RTL pattern enumeration: C++RISC8 assembler: PERLRISC8 simulator: PERLMachine level pattern enumeration: PERLMacro driven instruction implementation automation: PERLBenchmarks Benchmark Instructions #adpcmnull:nASSIGN(word,nAND(areg,const16))null:nASSIGN(word,nADD(areg,word))bool:nBOOL(areg,const16)bool:nBOOL(nAND(areg,const16),areg)areg:nIOR(nAND(areg,const16),word)4040863624GSM-encoderacc:nAND(acc,const08)acc:nAND(nASR(acc,const08),reg) acc:nIOR(nAND(acc,const08),reg) acc:nASR(acc,const08) acc:nIOR(nAND(nASR(acc,const08),const08),reg)796492414330621PRN-IOacc:nIOR(acc,const08)null:nASSIGN(byte,nIOR(acc,reg))null:nASSIGN(byte,nIOR(acc,const08))bool:nBOOL(nAND(areg,const16),const16)240969660LCD_4X20bool:nBOOL(acc,const08)null:nASSIGN(byte,nADD(reg,one))9930max7219Acc:nIOR(acc,const08)bool:nBOOL(nAND(acc,reg),zero)14048GSM encoderHardware/software tradeoffSoftware gain: execution speed, code sizeHardware cost: functional unit, decoding logic, data path configuration05001000150020002500300035004000base instr#1 instr#2 instr#3 instr#4code-sizecyclehardwareConclusions RTL level pattern enumerationKey to automating instruction identification, code-generation, assembly and simulationNo need to change algorithm source codeHardware/software trade-offGood estimation of performance gain and hardware cost at register-transfer levelOp-code
View Full Document