Berkeley COMPSCI C267 - Load Balancing - D1096721

Home> Schools> University of California, Berkeley> Computer Science (COMPSCI) > COMPSCI C267> Load Balancing

DOC PREVIEW

Berkeley COMPSCI C267 - Load Balancing

School name University of California, Berkeley

Course Compsci C267- Applications of Parallel Computers

Pages 8

This preview shows page 1-2-3 out of 8 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 8 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 8 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 8 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 8 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

4/12/2004CS267, Yelick 14/12/2004 CS267, Yelick 1CS 267: Applications of Parallel ComputersLoad BalancingKathy Yelickhttp://www-inst.eecs.berkeley.edu/~cs2674/12/2004 CS267, Yelick 2Load Imbalance in Parallel ApplicationsThe primary sources of inefficiency in parallel codes:• Poor single processor performance • Typically in the memory system• Too much parallelism overhead• Thread creation, synchronization, communication• Load imbalance• Different amounts of work across processors• Computation and communication• Different speeds (or available resources) for the processors• Possibly due to load on the machine• Recognizing load imbalance• Time spent at synchronization is high and is uneven across processors: don’t take min/max/med/mean of barrier times4/12/2004 CS267, Yelick 3Measuring Load Imbalance• Challenges:• Can be hard to separate from high synch overhead• Especially subtle if not bulk-synchronous• “Spin locks” can make synchronization look like useful work• Note that imbalance may change over phases• Insufficient parallel always leads to load imbalance•Tools like TAU (shown) can help4/12/2004 CS267, Yelick 4Load Balancing OverviewLoad balancing differs with properties of the tasks (chunks of work):• Tasks costs• Do all tasks have equal costs?• If not, when are the costs known?• Before starting, when task created, or only when task ends• Task dependencies• Can all tasks be run in any order (including parallel)?• If not, when are the dependencies known?• Before starting, when task created, or only when task ends• Locality• Is it important for some tasks to be scheduled on the same processor (or nearby) to reduce communication cost?• When is the information about communication known?4/12/2004 CS267, Yelick 5Task Cost Spectrum4/12/2004 CS267, Yelick 6Task Dependency Spectrum4/12/2004CS267, Yelick 24/12/2004 CS267, Yelick 7Task Locality Spectrum (Communication)4/12/2004 CS267, Yelick 8Spectrum of SolutionsOne of the key questions is when certain information about the load balancing problem is knownLeads to a spectrum of solutions:• Static scheduling. All information is available to scheduling algorithm, which runs before any real computation starts. (offline algorithms)• Semi-static scheduling. Information may be known at program startup, or the beginning of each timestep, or at other well-defined points. Offline algorithms may be used even though the problem is dynamic.• Dynamic scheduling. Information is not known until mid-execution. (online algorithms)4/12/2004 CS267, Yelick 9Dynamic Load Balancing• Motivation for dynamic load balancing• Search algorithms• Centralized load balancing•Overview• Special case for schedule independent loop iterations• Distributed load balancing•Overview• Engineering• Theoretical results• Load balancing in general• Example scheduling problem: mixed parallelism• Demonstrate use of coarse performance models4/12/2004 CS267, Yelick 10Search• Search problems are often:• Computationally expensive• Have very different parallelization strategies than physical simulations.• Require dynamic load balancing• Examples:• Optimal layout of VLSI chips• Robot motion planning• Chess and other games (N-queens)• Speech processing• Eigenvalue search• Constructing phylogeny tree from set of genes4/12/2004 CS267, Yelick 11Example Problem: Tree Search• In Tree Search the tree unfolds dynamically• May be a graph if there are common sub-problems along different paths• Graphs unlike meshes which are precomputed and have no ordering constraintsTerminal node (non-goal)Non-terminal nodeTerminal node (goal)4/12/2004 CS267, Yelick 12Sequential Search Algorithms• Depth-first search (DFS)• Simple backtracking• Search to bottom, backing up to last choice if necessary• Depth-first branch-and-bound• Keep track of best solution so far (“bound”)• Cut off sub-trees that are guaranteed to be worse than bound• Iterative Deepening• Choose a bound on search depth, d and use DFS up to depth d• If no solution is found, increase d and start again• Iterative deepening A* uses a lower bound estimate of cost-to-solution as the bound• Breadth-first search (BFS)• Search across a given level in the tree4/12/2004CS267, Yelick 34/12/2004 CS267, Yelick 13Parallel Search• Consider simple backtracking search•Use static load balancing: spawn each new task on an idle processor, until all have a subtreeLoad balance on 2 processors Load balance on 4 processors4/12/2004 CS267, Yelick 14Centralized Scheduling• Keep a queue of task waiting to be done• May be done by manager task• Or a shared data structure protected by locksTask Queueworkerworkerworkerworkerworkerworker4/12/2004 CS267, Yelick 15Centralized Task Queue: Scheduling Loops• When applied to loops, often called self scheduling:• Tasks may be range of loop indices to compute• Assumes independent iterations• Loop body has unpredictable time (branches) or the problem is not interesting• Originally designed for:• Scheduling loops by compiler (really the runtime-system)• Original paper by Tang and Yew, ICPP 1986• This is:• Dynamic, online scheduling algorithm• Good for a small number of processors (centralized)• Special case of task graph – independent tasks, known at once4/12/2004 CS267, Yelick 16Variations on Self-Scheduling• Typically, don’t want to grab smallest unit of parallel work, e.g., a single iteration• Instead, choose a chunk of tasks of size K.• If K is large, access overhead for task queue is small• If K is small, we are likely to have even finish times (load balance)• Four variations:1. Use a fixed chunk size2. Guided self-scheduling3. Tapering4. Weighted Factoring• Note: there are other variations4/12/2004 CS267, Yelick 17Variation 1: Fixed Chunk Size• Kruskal and Weiss give a technique for computing the optimal chunk size• Requires a lot of information about the problem characteristics• e.g., task costs as well as number• Not very useful in practice. • Task costs must be known at loop startup time• E.g., in compiler, all branches be predicted based on loop indices and used for task cost estimates4/12/2004 CS267, Yelick 18Variation 2: Guided Self-Scheduling• Idea: use larger chunks at the beginning to avoid excessive overhead and smaller chunks near the end to even out the finish times.• The chunk size Kiat the ithaccess to

View Full Document

Berkeley COMPSCI C267 - Load Balancing

Sign up for free to view:

This document and 3 million+ documents and flashcards
High quality study guides, lecture notes, practice exams
Course Packets handpicked by editors offering a comprehensive review of your courses
Better Grades Guaranteed


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-3 out of 8 pages.

Berkeley COMPSCI C267 - Load Balancing

Sign up for free to view:

Please select your school