DOC PREVIEW
CMU CS 15740 - Lecture

This preview shows page 1-2-17-18-19-35-36 out of 36 pages.

Save
View full document
Premium Document
Do you want full access? Go Premium and unlock all 36 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Parallel Programming Todd C Mowry CS 740 Topics October 16 18 Motivating Examples Parallel Programming 2000for High Performance Impact of the Programming Model Case Studies Ocean simulation Barnes Hut N body simulation Motivating Problems Simulating Ocean Currents Regular structure scientific computing Simulating the Evolution of Galaxies Irregular structure scientific computing Rendering Scenes by Ray Tracing Irregular structure computer graphics Not discussed here read in book 2 CS 740 F 00 Simulating Ocean Currents a Cross sections b Spatial discretization of a cross section Model as two dimensional grids Discretize in space and time finer spatial and temporal resolution greater accuracy Many different computations per time step set up and solve equations Concurrency across and within grid computations 3 CS 740 F 00 Simulating Galaxy Evolution Simulate the interactions of many stars evolving over time Computing forces is expensive m1m2 O n2 brute force approach Hierarchical Methods take advantage of force law r2G Star on which for ces are being computed Star too close to approximate Many 4 Large group far enough away to approximate Small gr oup far enough away to approximate to center of mass time steps plenty of concurrency across stars within one CS 740 F 00 Rendering Scenes by Ray Tracing Shoot rays into scene through pixels in image plane Follow their paths they bounce around as they strike objects they generate new rays ray tree per input ray Result is color and opacity for that pixel Parallelism across rays All case studies have abundant concurrency 5 CS 740 F 00 Parallel Programming Task Break up computation into tasks assign tasks to processors Break up data into chunks assign chunks to memories Introduce synchronization for mutual exclusion event ordering 6 CS 740 F 00 Steps in Creating a Parallel Program Partitioning D e c o m p o s i t i o n Sequential computation A s s i g n m e n t Tasks p0 p1 p2 p3 Processes O r c h e s t r a t i o n p0 p1 p2 p3 Parallel pr ogram M a p p i n g P0 P1 P2 P3 Processors 4 steps Decomposition Assignment Orchestration Mapping Done by programmer or system software compiler runtime Issues are the same so assume programmer does it all explicitly 7 CS 740 F 00 Partitioning for Performance Balancing the workload and reducing wait time at synch points Reducing inherent communication Reducing extra work Even these algorithmic issues trade off Minimize comm run on 1 processor extreme load imbalance Maximize load balance random assignment of tiny tasks no control over communication Good partition may imply extra work to compute or manage it Goal is to compromise Fortunately often not difficult in practice 8 CS 740 F 00 Load Balance and Synch Wait Time Sequential Work Limit on speedup Speedupproblem p Max Work on any Processor Work includes data access and other costs Not just equal work but must be busy at same time Four parts to load balance and reducing synch wait time 1 Identify enough concurrency 2 Decide how to manage it 3 Determine the granularity at which to exploit it 4 Reduce serialization and cost of synchronization 9 CS 740 F 00 Deciding How to Manage Concurrency Static versus Dynamic techniques Static Algorithmic assignment based on input won t change Low runtime overhead Computation must be predictable Preferable when applicable except in multiprogrammed heterogeneous environment Dynamic Adapt at runtime to balance load Can increase communication and reduce locality Can increase task management overheads 10 CS 740 F 00 Dynamic Assignment Profile based semi static Profile work distribution at runtime and repartition dynamically Applicable in many computations e g Barnes Hut some graphics Dynamic Tasking Deal with unpredictability in program or environment e g Raytrace computation communication and memory system interactions multiprogramming and heterogeneity used by runtime systems and OS too Pool of tasks take and add tasks until done E g self scheduling of loop iterations shared loop counter 11 CS 740 F 00 Dynamic Tasking with Task Queues Centralized versus distributed queues Task stealing with distributed queues Can compromise comm and locality and increase synchronization Whom to steal from how many tasks to steal Termination detection Maximum All processes imbalance related to size of task insert tasks P0 inserts QQ 0 P1 inserts P2 inserts P3 inserts Q1 Q2 Q3 Others may steal All remove tasks a Centralized task queue 12 P0 removes P1 removes P2 removes P3 removes b Distributed task queues one per pr ocess CS 740 F 00 Determining Task Granularity Task granularity amount of work associated with a task General rule Coarse grained often less load balance Fine grained more overhead often more communication and contention Communication and contention actually affected by assignment not size Overhead by size itself too particularly with task queues 13 CS 740 F 00 Reducing Serialization Careful about assignment and orchestration including scheduling Event synchronization Reduce use of conservative synchronization e g point to point instead of barriers or granularity of pt to pt But fine grained synch more difficult to program more synch ops Mutual exclusion Separate locks for separate data e g locking records in a database lock per process record or field lock per task in task queue not per queue finer grain less contention serialization more space less reuse Smaller less frequent critical sections don t do reading testing in critical section only modification e g searching for task to dequeue in task queue building tree Stagger critical sections in time 14 CS 740 F 00 Reducing Inherent Communication Communication is expensive Measure communication to computation ratio Focus here on inherent communication Determined by assignment of tasks to processes Later see that actual communication can be greater Assign tasks that access same data to same process Solving communication and load balance NPhard in general case But simple heuristic solutions work well in practice Applications have structure 15 CS 740 F 00 Domain Decomposition Works well for scientific engineering graphics applications Exploits local biased nature of physical problems Information requirements often short range Or long range but fall off with distance Simple example nearest neighbor grid computation n n p P0 P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P12 P13 P14 P15 n n p Perimeter to Area comm to comp ratio area to volume in 3D Depends on n p decreases with n increases with


View Full Document

CMU CS 15740 - Lecture

Documents in this Course
leecture

leecture

17 pages

Lecture

Lecture

9 pages

Lecture

Lecture

9 pages

Lecture

Lecture

13 pages

lecture

lecture

25 pages

lect17

lect17

7 pages

Lecture

Lecture

65 pages

Lecture

Lecture

28 pages

lect07

lect07

24 pages

lect07

lect07

12 pages

lect03

lect03

3 pages

lecture

lecture

11 pages

lecture

lecture

20 pages

lecture

lecture

11 pages

Lecture

Lecture

9 pages

Lecture

Lecture

10 pages

Lecture

Lecture

22 pages

Lecture

Lecture

28 pages

Lecture

Lecture

18 pages

lecture

lecture

63 pages

lecture

lecture

13 pages

Lecture

Lecture

36 pages

Lecture

Lecture

18 pages

Lecture

Lecture

17 pages

Lecture

Lecture

12 pages

lecture

lecture

34 pages

lecture

lecture

47 pages

lecture

lecture

7 pages

Lecture

Lecture

18 pages

Lecture

Lecture

7 pages

Lecture

Lecture

21 pages

Lecture

Lecture

10 pages

Lecture

Lecture

39 pages

Lecture

Lecture

11 pages

lect04

lect04

40 pages

Load more
Loading Unlocking...
Login

Join to view Lecture and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?