A Review of Customized Dynamic Load Balancing for a Network of Workstations Taken from work done by Mohammed Javeed Zaki Wei Li Srinivasan Parthasarathy Computer Science Department University of Rochester June 1997 Presenter Jacqueline Ewell Introduction With the rapid advances in computational power speed cost memory and high speed network technologies a Network of Workstations provides an attractive scalable alternative compared to custom parallel machines Scalable distributed shared memory architectures rely on the ability for each processor to maintain a load balance Load balancing involves assigning each processor a proportional amount of work bases on its performance thereby minimizing total execution time of the program There are two types of load balancing static and dynamic Static load balancing allows the programmer to delegate responsibility to each processor before run time The simplest approach is the static block scheduling scheme which assigns equal blocks of iterations to each of the processors Another scheme of static load balancing is the static interleaved scheme which assigns iteration in a cyclic fashion This makes programming much easier and can solve many problems inherit in load balancing caused by processor heterogeneity and non uniform loops Static scheduling avoids run time scheduling overheads However due to transient external loads applied by multiple users on a network of workstations heterogeneity with processors different speeds memory varying amounts of available memory networks varying cost of communication among pairs of processors and software programs may have varying amount of work in each iteration a dynamic approach to load balancing is necessary Dynamic load balancing allows during run time the ability to delegate work based on run time performance of the networked set of workstations keeping in mind the trade off of task switching and loadimbalance costs The idea of Dynamic vs Static load balance is not new However there have been many load balancing schemes developed each with specific applications in mind under varying programs and system parameters Therefore wouldn t it be nice to customize dynamic load balancing dependant on applications to yield the best performance possible The paper entitled Customized Dynamic Load Balancing for a Network of Workstations identifies this problematic situation and presents a hybrid compile time and runtime modeling and decision process which selects the best scheme based on run time feedback from the processors and its performance Dynamic Load Balancing DLB Strategies When the execution time of loop iterations is not predictable at compile time runtime dynamic scheduling should be used with the additional runtime cost of managing task allocation There have been a number of models developed for dynamic loop scheduling The task queue model has been targeted towards shared memory machines while the diffusion model has been used for distributed memory machines 4 5 6 The task queue model assumes a central task queue where if a processor finishes its assigned work more work can be obtained from the queue This approach can be self guided or a master may remove the work and allocate it the processor In the diffusion model all work is delegated to each processor and when an imbalance is detected between a processor and its neighbor work movement occurs A third scheduling strategy may be used and is the bases of the paper in discussion The approach uses past performance information to predict the future performance Among these approaches have been the Global Distributed scheme the Global Central scheme the Local Distributed scheme the Global Centralized scheme using automatic generation of parallel programs the Local Distributed receiver initiated scheme and the Local Centralized scheme In all cases the load balancing involves periodic exchange of performance information The method discussed in the paper implements instead of periodic exchange an interrupt base receiver Its goal is to select among the given schemes the best load balancing scheme at run time These strategies lie on the extremes points on the two axes global vs local and central vs distributed There are more schemes that lie within the extreme points and they will be explored in future works Distribute d Centralize Local Global Global Strategy In the global scheme all load balancing decisions are made on a global scale That is all processors get an interrupt once one processor has completed its work Then all the processors send their performance profile to the load balancer In the Centralized scheme the load balancer is located in a centralized area often on the master processor After calculating the new distribution the load balancer sends instructions to all the processors who have to send work to the others The receiving processors just wait until they have received all the data necessary to proceed In a Distributed scheme the load balancer is replicated on all the processors The profile information is broadcast to all the processors and the receiving processors wait for the work while the sending processors ship the data This method eliminates the need for instructions from the master processor Local Strategy In the local scheme the processors are divided into groups of size K The load balancing decisions are only done within a group K block static partitioning other approaches such as K nearest neighbor and dynamic partitioning may also be used Profile information is only exchanged with a group The Centralized scheme still only has one load balancer instead of one per group The load balancer besides learning the profile performance of each processor must also keep track of which group each processor is in As for the Distributed scheme each processor still maintains its own load balancer and profiles are sent amongst all the processors in the group Tradeoffs Global vs Local For the Global scheme since global information is available at synchronization time the work distribution is nearly optimal and convergence is faster than Local scheme However communication and synchronization cost is much higher in the Global scheme For the Local scheme even though initially the groups may be partitioned statically all groups are assumed to be divided up evenly by performance and speed performance of each processor may change over time Thus one group of processors with poor performance will be overloaded while another group of processors may be idle
View Full Document
Unlocking...