Introduction Parallel Programming Performance Rich space of techniques and issues Trade off and interact with one another Issues can be addressed helped by software or hardware Algorithmic or programming techniques Architectural techniques Todd C Mowry CS 740 September 25 2007 Focus here on performance issues and software techniques Point out some architectural implications Architectural techniques covered in rest of class 2 Programming as Successive Refinement CS 740 F 07 Outline Not all issues dealt with up front Partitioning often independent of architecture and done first Partitioning for performance Relationship of communication data locality and architecture View machine as a collection of communicating processors balancing the workload reducing the amount of inherent communication reducing extra work Tug o war even among these three issues Programming for performance For each issue Then interactions with architecture View machine as extended memory hierarchy extra communication due to architectural interactions cost of communication depends on how it is structured May inspire changes in partitioning Techniques to address it and tradeoffs with previous issues Illustration using case studies Application to grid solver Some architectural implications Components of execution time as seen by processor What workload looks like to architecture and relate to software issues Discussion of issues is one at a time but identifies tradeoffs Use examples and measurements on SGI Origin2000 3 CS 740 F 07 4 Page 1 CS 740 F 07 Partitioning for Performance Load Balance and Synch Wait Time Balancing the workload and reducing wait time at synch points Reducing inherent communication Reducing extra work Limit on speedup Speedupproblem p Work includes data access and other costs Not just equal work but must be busy at same time Four parts to load balance and reducing synch wait time Even these algorithmic issues trade off 1 Identify enough concurrency Minimize comm run on 1 processor extreme load imbalance Maximize load balance random assignment of tiny tasks no control over communication Good partition may imply extra work to compute or manage it 2 Decide how to manage it 3 Determine the granularity at which to exploit it Goal is to compromise 4 Reduce serialization and cost of synchronization Fortunately often not difficult in practice 5 CS 740 F 07 6 Identifying Concurrency Function parallelism Loop structure fundamental dependences new algorithms Data Parallelism versus Function Parallelism Often see orthogonal levels of parallelism e g VLSI routing W2 W3 a Wire W2 expands to segments S21 S22 S23 S24 S25 CS 740 F 07 Identifying Concurrency contd Techniques seen for equation solver W1 Sequential Work Max Work on any Processor S26 entire large tasks procedures that can be done in parallel on same or different data e g different independent grid computations in Ocean pipelining as in video encoding decoding or polygon rendering degree usually modest and does not grow with input size difficult to load balance often used to reduce synch between data parallel phases Most scalable programs data parallel per this loose definition b Segment S23 expands to routes function parallelism reduces synch between data parallel phases c 7 CS 740 F 07 8 Page 2 CS 740 F 07 Deciding How to Manage Concurrency Dynamic Assignment Static versus Dynamic techniques Profile based semi static Profile work distribution at runtime and repartition dynamically Applicable in many computations e g Barnes Hut some graphics Static Algorithmic assignment based on input won t change Low runtime overhead Computation must be predictable Preferable when applicable except in multiprogrammed or heterogeneous environment Dynamic Tasking Deal with unpredictability in program or environment e g Raytrace computation communication and memory system interactions multiprogramming and heterogeneity used by runtime systems and OS too Pool of tasks take and add tasks until done E g self scheduling of loop iterations shared loop counter Dynamic Adapt at runtime to balance load Can increase communication and reduce locality Can increase task management overheads 9 CS 740 F 07 10 Dynamic Tasking with Task Queues Impact of Dynamic Assignment Centralized versus distributed queues Task stealing with distributed queues On SGI Origin 2000 cache coherent shared memory All processes insert tasks P0 inserts QQ P1 inserts Q1 0 P2 inserts Q2 z 30 Can compromise comm and locality and increase synchronization Whom to steal from how many tasks to steal Termination detection Maximum imbalance related to size of task 20 P3 inserts a Centralized task queue 11 P2 removes z 20 z 15 a 0 P1 removes Origin dynamic Challenge dynamic Origin static Challenge static 25 10 Q3 Others may steal P0 removes z 30 z z z z 15 z z z 5 z 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 P3 removes 10 z 5 All remove tasks Origin semistatic Challenge semistatic Origin static Challenge static 25 Speedup CS 740 F 07 Speedup Number of processors b 0 z z z 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 Number of processors b Distributed task queues one per pr ocess CS 740 F 07 12 Page 3 CS 740 F 07 Determining Task Granularity Reducing Serialization Careful about assignment and orchestration including scheduling Event synchronization Task granularity amount of work associated with a task General rule Reduce use of conservative synchronization e g point to point instead of barriers or granularity of pt to pt But fine grained synch more difficult to program more synch ops Coarse grained often less load balance Fine grained more overhead often more communication contention Mutual exclusion Separate locks for separate data e g locking records in a database lock per process record or field lock per task in task queue not per queue finer grain less contention serialization more space less reuse Smaller less frequent critical sections don t do reading testing in critical section only modification e g searching for task to dequeue in task queue building tree Stagger critical sections in time Communication contention actually affected by assignment not size Overhead by size itself too particularly with task queues 13 CS 740 F 07 14 Reducing Inherent Communication CS 740 F 07 Domain Decomposition Communication is expensive Measure communication to computation ratio Focus here on inherent communication Works well for scientific engineering graphics applications Exploits local biased nature of physical
View Full Document
Unlocking...