UW-Madison COMPSCI 701 - Instruction Scheduling When Memory Latency is Uncertain

Unformatted text preview:

InstructionBalanced Scheduling:Scheduling When Memory Latency is UncertainDaniel R. Kernst and Susan J. EggersDepartment of Computer Science and EngineeringUniversity of Washington, FR-35Seattle, WA 98195E-Mail: kerns@ pure.com & [email protected] list schedulers order itxwructions based on an op-timistic estimate of the load delay imposed by the implemen-tation. Therefore they cannot respond to variations in loadlatencies (due to cache hits or misses, congestion in the mem-ory interconnect, etc.) and cannot easily be applied acrossdifferent implementations. We have developed an altern-ative algorithm, known as balanced scheduling, that sched-ules instructions based on an estimate of the amount of in-struction level parallelism in the program. Since schedulingdecisions are program- rather than machine-based, balancedscheduling is unaffected by implementation changes. Sinceit is based on the amount of instruction level parallelism thata program can support, it can respond better to variationsin load latencies. Performance improvements over a tradi-tional list scheduler on a Fortran workload and simulatingseveral different machine types (cache-based workstations,large parallel machines with a multipath interconnect anda combination, atl with non-blocking processors) are quitegood, averaging between 3910and 18%.1 IntroductionInstruction schedulers for conventional machines generatecode assuming a machine model in which load latencies arewell-defined and fixed. Usually the latencies reflect the mostoptimistic execution situation, e.g., the time of a cache hitrather than a cache miss. Compiler optimizations intended toThis researchwassupportedby NSFGrantNo. CCR-9114167,NSFPYI AwardNo. MIP-9058-439,ONR GrantNo. NOO014-92-J-1395andtheWashingtonTechnologyCenter.tAuthor’s currentaddress:PureSoftware,1309S.Mary Ave,Sunnyvale,CA 94087Permission to copy without fee all or part of this material isgranted provided that the copies are not made or distributed fordirect commercial advantage, the ACM copyright notice and thetitle of the publication and its date appear, and notice is giventhat copying is by permission of the Association for ComputingMachinery. To copy otherwise, or to republish, requires a feeand/or specific permission.ACM-S lGPLAN-PLDl-6/93 /AlbuquerqueJ N.M.@ 1993 ACM 0-89791 -598 -4/93 /0006 /0278 ...$1 .50improve performance through instruction scheduling, suchas reordering instructions to avoid pipeline stalls, insert inde-pendent instructions after loads to keep the CPU busy whilememory references are in progress. The number of instruc-tions inserted (in the best case) depends on this predefinelatency value.When a load reference must exceed the implementation-defined latency, the processor architecture generally stipu-lates that instruction execution be stalled. The advantage ofthis design (called blocking loads) is that it requires a simpleand straightforward hardware implementation. The conse-quence for compiler technology is that the compiler does nothave to consider multiple memory latencies during instruc-tion scheduling.Two architectural innovations make it worthwhile to re-consider how to schedule behind load instructions, The firstis processor designs that do not stall on unsatisfied load ref-erences (called nonblocking loads) through the use of lockupfree caches[17, 18,15, 13], multiple hardware contexts[2, 1]or an instruction lookahead scheme[2]. Nonblocking loadsallow a processor to continue executing other instructionswhile a load is in progress. Although the design requiresmore complex hardware, more instruction level parallelismcan be exploited, and therefore programs execute faster. Thesecond innovation is machines that have a large variance inmemory response time. They may be due to congestion in amultipath interconnector a hierarchy of memory, includingboth cache hierarchies and local and global memories.Variable load instruction latencies, coupled with non-blocking loads, complicate scheduling, because the instruc-tion scheduler does not know how many instructions toschedule after a load to maintain high processor utilization.If the memory reference is delayed beyond the scheduler’slatency estimate, the processor will stall and processor uti-lization will drop. However, if the load latency is shorterthan the estimate, the destination register of a load instruc-tion will be tied up longer than necessary. This may increaseregister pressure enough to cause unnecessary spills to mem-ory and a consequent increase in program execution time. Inaddition, an excessive number of instructions may migrate278to the top of the schedule, leaving an insufficient number tohide load latencies near the bottom, In this case the CPU willalso be needlessly idled.In this paper we present a code scheduling algorithm,called balanced scheduling, that has been specifically de-signed to tolerate a wide range of variance in load Iateneyover the entire execution of a program. Balanced schedul-ing works within the context of a traditional list scheduler[9,14,21,8, 6], but uses a new method for calculating load in-struction weights. Rather than using weights that are deter-mined by the implementation and therefore are fixed for allprograms, the weight of each load is based on the amount ofinstruction level parallelism that is available to it. (We referto this as load level parallelism.) This assignment is effec-tive, since load instructions are scheduled for the maximumlatency that can be sustained by the amount of load level par-allelism in the code, In essence, our algorithm schedulesfor the code instead of scheduling for the machine. Locik-ing at it another way, balanced scheduling amortizes the costof incorrectly estimating actual load latencies over all loi~dinstructions in the program.To validate the atgorithm we compared the performanceof several programs scheduled via batanced scheduling artdlatraditional list scheduler on a variety of processor and menn-ory architectures. The processor models differed in theirability to exploit load level parallelism; each was coupledwith three different memory systems, that exhibit dissimilarlatency behavior. Both the balanced scheduler and the tra-ditional scheduler were incorporated into the GCC[19] comp-iler and generated code for the Perfect Club benchmarks[41].Performance improvements for balanced scheduling aver-aged 3% to 18% over the traditional list scheduler, for dif-ferent processor and system model


View Full Document

UW-Madison COMPSCI 701 - Instruction Scheduling When Memory Latency is Uncertain

Download Instruction Scheduling When Memory Latency is Uncertain
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Instruction Scheduling When Memory Latency is Uncertain and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Instruction Scheduling When Memory Latency is Uncertain 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?