DOC PREVIEW
CMU CS 15740 - Thread Scheduling for Chip-Multiprocessors

This preview shows page 1-2-3 out of 8 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Thread Scheduling for Chip-Multiprocessors: Final ReportRavishankar Krishnaswamy ([email protected])Harshavardhan Simhadri ([email protected])December 3, 2009AbstractMost parallel programs exhibit more parallelism than is available in processors pro-duced today. While a user may optimize a program to incur few cache misses and runfast, when such a program is cut in to threads and run in parallel, it’s run time is heav-ily dependent on the scheduler used to coordinate these threads. We investigate thecase of sceduling for systems with cache hierarchies. We contrast schedulers such as thework stealing scheduler used in programming tools such like Cilk++ with a schedulerwe propose. We argue that work stealing may not be optimal under all workloads forsuch systems.1 IntroductionBecause of the speed limitations imposed by hardware on a single core (threaded) design,much effort has gone into making processors faster by creating many smaller cores andusing parallelism in the application to create multiple threads that can use this multiplecores. Since multiple cores are printed on a chip at the expense of complex logic usedto accelerate prefetching and prediction logic, it is important for the application to beable to work well with this limited capabilities.In most parallel applications, it is easy to find which parts can be parallelised.However, it is difficult to figure out how to arrange this parallel computations over themany available cores in such a way that the execution incurs low memory latency due tothe distributed nature of the caching system. Even when the algorithm has enormousparallelism, a good thread scheduler is crucial to get good a run time.1.1 The problemThe problem of schedulers for completely shared [BG04] or completely distributedcaches [BL99] is well understood. In fact, the latter work has b e en successfully im-plemented as the Cilk++ [BJKLRZ96] run time system. Cilk++ (like OpenMP)enables the programmer to write C programs with simple directives to indicate paralleltasks and get an executable with the scheduler suitable for the given directives built1in. Similar research on implementing schedulers for shared cache machines [N99] hasbeen done. However, there is no consensus on schedulers for hierarchical caches that areneither completely distributed nor shared. The work on provably goo d schedulers fordivide and conquer algorithms [BCGRCK] provides a suitable starting point for thiswork. Our scheduler is a multi-level generalization of a slightly more aggressive variantof scheduler proposed in this work.We intend to build a s ystem similar to Cilk++ or OpenMP for such systems whichallows the user to specify points of parallelism in a nested parallel program, and usesa smart scheduler to coordinate this parallel pieces of computation. We explore ascheduling algorithm suitable for reasonably parallel programs for systems with suchcache hierarchies. Further, we assume that when a new thread is spawned, the program-mer is able to provide us with a close upper bound on the number of distinct memoryaccesses of each threads (including all nested computation spawned by this thread).While we limit our study to cache hierarchies that look like fat trees, insights learnedfrom such a setting could also be used for cache systems with grid interconnects. Forinstance, one may view a 9 × 9 grid as a collection of 9 sets of 3 × 3 nodes. Such setsmay be views as nodes in tree with memory as the root and pro ce ss ors at the leaves.2 Our approach[BL99] describes the work stealing scheduler for distributed caches. In a work stealingscheduler, a queue is assigned to each processor. When the thread a processor is runningsplits in to a set of parallel threads, the processor continues to work on one of thesethreads and pushes the rest at the end of it’s job queue. When a processor runs out ofwork it takes a job from the front of it’s queue. In case there are no jobs in it’s queue,it steals a job from the front of another processor’s queue. This is a greedy schedulerwhich can be shown to be reasonably cache friendly for distributed caches. It also workswell in practice for such ma chines.However, in a cache hierarchy, a processor is not aware of the fact that not all stealsare of the same cost. Steals from processors separated by several upper level caches arecostly and should be avoided when jobs are available closer in the memory hierarchy.We propose the following non-greedy scheduler (HR scheduler) that makes better useof locality in computation:• Associate a job queue and a bit (called isblocked bit) with each cache.• Assign the root thread to any processor. If a processor spawns a new thread, checkthe space requirement of the newly spawned threads. If the space falls betweenthat of a level Licache and that of Li+1, push the newly spawned jobs in to thequeue of the Li+1cache under which the current processor falls. Also set theis blocked bit of this cache to one.• If a processor runs out of jobs, it starts searching for a new job starting fromthe cache closest to it. If it finds a job in it’s L1 cache, it works on it. If thisqueue is empty and the L1’s is blocked bit is 0, it moves to the cache one levelup. It continues this process until it finds a job or it comes across a cache whoseis blocked is set to 1. In the latter case, it waits till this bit is res et to 0 orsome job shows up on one of the lower level caches between the processor and theblocked cache.2Further motivation for choosing this scheduler is as follows:• The cache complexity of a computation under this scheduler is almost the sameas that under a sequential schedule.• It minimizes cache invalidations over large distancesThis scheduler is non-greedy. The scheduler might wait on a blocked cache whenwork is available elsewhere. This is not a problem for algorithms where the sequentialparts of a computation do not need more than the space available in an L1 cache. Thisis usually the case with most parallel algorithms - they can be written in a way suchthat the individual sequential threads are small.Our implementation approachImplementing thread scheduling at the OS level would be the most time efficient ap-proach. However, this approach is inflexible in that it does not allow users to choosetheir scheduling. Further, building a secondary scheduler on top of OS thread sched-uler is useless and works against the purpose. Therefore, we choose to implement ourscheduler at the user level on


View Full Document

CMU CS 15740 - Thread Scheduling for Chip-Multiprocessors

Documents in this Course
leecture

leecture

17 pages

Lecture

Lecture

9 pages

Lecture

Lecture

36 pages

Lecture

Lecture

9 pages

Lecture

Lecture

13 pages

lecture

lecture

25 pages

lect17

lect17

7 pages

Lecture

Lecture

65 pages

Lecture

Lecture

28 pages

lect07

lect07

24 pages

lect07

lect07

12 pages

lect03

lect03

3 pages

lecture

lecture

11 pages

lecture

lecture

20 pages

lecture

lecture

11 pages

Lecture

Lecture

9 pages

Lecture

Lecture

10 pages

Lecture

Lecture

22 pages

Lecture

Lecture

28 pages

Lecture

Lecture

18 pages

lecture

lecture

63 pages

lecture

lecture

13 pages

Lecture

Lecture

36 pages

Lecture

Lecture

18 pages

Lecture

Lecture

17 pages

Lecture

Lecture

12 pages

lecture

lecture

34 pages

lecture

lecture

47 pages

lecture

lecture

7 pages

Lecture

Lecture

18 pages

Lecture

Lecture

7 pages

Lecture

Lecture

21 pages

Lecture

Lecture

10 pages

Lecture

Lecture

39 pages

Lecture

Lecture

11 pages

lect04

lect04

40 pages

Load more
Download Thread Scheduling for Chip-Multiprocessors
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Thread Scheduling for Chip-Multiprocessors and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Thread Scheduling for Chip-Multiprocessors 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?