Unformatted text preview:

Hardware Loop BufferingScott DiPasqualeKhaled ElmeleegyC.J. GanierErik SwansonProblem and Hypothesisn Compilers have more information aboutloops that they can exploit.n Hardware loop unrolling is highlycomplex.n Buffering loops reduces instructioncount and improves performance withmoderate complexity.Loop Definitionsn Dynamic: The number of loop iterationsis determined within the loop itself.n Semi-dynamic: The number of loopiterations is determined prior toentering the loop; the specific value isunknown to the compiler.n Static: Fixed iteration count for a loop;the value is known to the compiler.Loop Definitionsn Well-formed: There are no controltransfers in the loop other than the exitbranch.n Ill-formed: Anything not well-formed.n We are concerned with static and semi-dynamic loops that are well formed.MotivationPercentage of loops in benchmark thatloop buffering can affectNormalized Loop Percentages00.20.40.60.811.2art bzip2 equakeBenchmarkLoop PercentageTotal LoopsStatic Well-Formed LoopsSemi-Dynamic Well-FormedTotal CoverageMotivationPercentage of total execution time spentin relevant loopsPercent Execution Time0%10%20%30%40%50%60%art bzip2 equakeBenchmarks% Execution TimePercent Execution TimeISA changes – LOOP & ENDLn LOOP Instruction takes immediate orregister with number of iterations.n Loop repeats without branch overheadn ENDL Instruction signals end of loopExample loopsLoop:SLL r3, r2, 2ADD r3, r3, r5LW r1, 0(r3)ADD r1, r1, r4SW r1, 0(r3)ADDI r2, r2, 1SGT r5, r2, r6BEQ r0, r5, LoopLOOP r6, regLoop:SLL r3, r2, 2ADD r3, r3, r5LW r1, 0(r3)ADD r1, r1, r4SW r1, 0(r3)ADDI r2, r2, 4ENDLHardware Implementationn 64b x 1024 entry buffern Loop counter, loop instruction index,loop instruction maximum, number oftimes to duplicate an instruction.n If loop bound is in a register, maximumnumber of iterations is 2^32; animmediate allows 1024 iterations.Pipeline Block DiagramLoop CacheInstructionFetchIssueMUXLoop IssueLoop InformationRegisterModeModeModePCLoop PCwhen inUnrollingModeBuffer Block DiagramLoopInstructionLoop InstructionIndex816243240.....................8184+Program CounterLoopBase PCLoop Cache(0)Loop CacheLoop InstructionMax=+ 8MUX0Loop Counter-1=0?Stop LoopExecutionExperimentationn Simple Scalar modified to bypassinstruction fetch and to add instructionsn A small, but relevant, benchmark(183.Equake) was chosen.n Only one benchmark was used due tothe necessity of analyzing andmodifying assembly by hand.Resultsn On average, individual loops in Equakesaw a speedup of 1.15.n Using Amdahl’s law, this equates to anoverall speedup of 1.062 for Equake.n Other benchmarks have similar loopcharacteristics, and are expected tohave similar speedups.Conclusionn Speedups achieved are appreciable.n Hardware complexity is moderate.n Future work can build on the frameworkdeveloped here to include suchelements as loop


View Full Document
Download Hardware Loop Buffering
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Hardware Loop Buffering and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Hardware Loop Buffering 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?