Thread Scheduling for Chip Multiprocessors Final Report Ravishankar Krishnaswamy ravishan cs cmu edu Harshavardhan Simhadri harshas cs cmu edu December 3 2009 Abstract Most parallel programs exhibit more parallelism than is available in processors produced today While a user may optimize a program to incur few cache misses and run fast when such a program is cut in to threads and run in parallel it s run time is heavily dependent on the scheduler used to coordinate these threads We investigate the case of sceduling for systems with cache hierarchies We contrast schedulers such as the work stealing scheduler used in programming tools such like Cilk with a scheduler we propose We argue that work stealing may not be optimal under all workloads for such systems 1 Introduction Because of the speed limitations imposed by hardware on a single core threaded design much effort has gone into making processors faster by creating many smaller cores and using parallelism in the application to create multiple threads that can use this multiple cores Since multiple cores are printed on a chip at the expense of complex logic used to accelerate prefetching and prediction logic it is important for the application to be able to work well with this limited capabilities In most parallel applications it is easy to find which parts can be parallelised However it is difficult to figure out how to arrange this parallel computations over the many available cores in such a way that the execution incurs low memory latency due to the distributed nature of the caching system Even when the algorithm has enormous parallelism a good thread scheduler is crucial to get good a run time 1 1 The problem The problem of schedulers for completely shared BG04 or completely distributed caches BL99 is well understood In fact the latter work has been successfully implemented as the Cilk BJKLRZ96 run time system Cilk like OpenMP enables the programmer to write C programs with simple directives to indicate parallel tasks and get an executable with the scheduler suitable for the given directives built 1 in Similar research on implementing schedulers for shared cache machines N99 has been done However there is no consensus on schedulers for hierarchical caches that are neither completely distributed nor shared The work on provably good schedulers for divide and conquer algorithms BCGRCK provides a suitable starting point for this work Our scheduler is a multi level generalization of a slightly more aggressive variant of scheduler proposed in this work We intend to build a system similar to Cilk or OpenMP for such systems which allows the user to specify points of parallelism in a nested parallel program and uses a smart scheduler to coordinate this parallel pieces of computation We explore a scheduling algorithm suitable for reasonably parallel programs for systems with such cache hierarchies Further we assume that when a new thread is spawned the programmer is able to provide us with a close upper bound on the number of distinct memory accesses of each threads including all nested computation spawned by this thread While we limit our study to cache hierarchies that look like fat trees insights learned from such a setting could also be used for cache systems with grid interconnects For instance one may view a 9 9 grid as a collection of 9 sets of 3 3 nodes Such sets may be views as nodes in tree with memory as the root and processors at the leaves 2 Our approach BL99 describes the work stealing scheduler for distributed caches In a work stealing scheduler a queue is assigned to each processor When the thread a processor is running splits in to a set of parallel threads the processor continues to work on one of these threads and pushes the rest at the end of it s job queue When a processor runs out of work it takes a job from the front of it s queue In case there are no jobs in it s queue it steals a job from the front of another processor s queue This is a greedy scheduler which can be shown to be reasonably cache friendly for distributed caches It also works well in practice for such machines However in a cache hierarchy a processor is not aware of the fact that not all steals are of the same cost Steals from processors separated by several upper level caches are costly and should be avoided when jobs are available closer in the memory hierarchy We propose the following non greedy scheduler HR scheduler that makes better use of locality in computation Associate a job queue and a bit called is blocked bit with each cache Assign the root thread to any processor If a processor spawns a new thread check the space requirement of the newly spawned threads If the space falls between that of a level Li cache and that of Li 1 push the newly spawned jobs in to the queue of the Li 1 cache under which the current processor falls Also set the is blocked bit of this cache to one If a processor runs out of jobs it starts searching for a new job starting from the cache closest to it If it finds a job in it s L1 cache it works on it If this queue is empty and the L1 s is blocked bit is 0 it moves to the cache one level up It continues this process until it finds a job or it comes across a cache whose is blocked is set to 1 In the latter case it waits till this bit is reset to 0 or some job shows up on one of the lower level caches between the processor and the blocked cache 2 Further motivation for choosing this scheduler is as follows The cache complexity of a computation under this scheduler is almost the same as that under a sequential schedule It minimizes cache invalidations over large distances This scheduler is non greedy The scheduler might wait on a blocked cache when work is available elsewhere This is not a problem for algorithms where the sequential parts of a computation do not need more than the space available in an L1 cache This is usually the case with most parallel algorithms they can be written in a way such that the individual sequential threads are small Our implementation approach Implementing thread scheduling at the OS level would be the most time efficient approach However this approach is inflexible in that it does not allow users to choose their scheduling Further building a secondary scheduler on top of OS thread scheduler is useless and works against the purpose Therefore we choose to implement our scheduler at the user level on top of the OS level basic thread scheduler which simply leaves the threads in their
View Full Document
Unlocking...