Parallel System Performance Evaluation Scalability Factors affecting parallel system performance Algorithm related parallel program related architecture hardware related Workload Driven Quantitative Architectural Evaluation Select applications or suite of benchmarks to evaluate architecture either on real or simulated machine From measured performance results compute performance metrics Speedup System Efficiency Redundancy Utilization Quality of Parallelism Resource oriented Workload scaling models How the speedup of an application is affected subject to specific constraints Problem constrained PC Fixed load Model Time constrained TC Fixed time Model 2 Memory constrained MC Fixed Memory Model 1 3 Performance Scalability Definition Conditions of scalability Factors affecting scalability Parallel Computer Architecture Chapter 4 Parallel Programming Chapter 1 handout For a given parallel system and parallel problem algorithm Informally The ability of parallel system performance to increase with increased problem and system size EECC756 Shaaban 1 lec 9 Spring2008 4 29 2008 Parallel Program Performance Parallel processing goal is to maximize speedup Speedup Sequential Work Time 1 Time p Max Work Synch Wait Time Comm Cost Extra Work By Balancing computations overheads on processors every processor hasParallelizing the same amount Overheads Max for workload any processor of work overheads Minimizing communication cost and other overheads associated with each step of parallel program 1 creation and execution 2 Parallel Performance Scalability For a given parallel system and parallel computation problem algorithm Achieve a good speedup for the parallel application on the parallel architecture as problem size and machine size number of processors are increased Or Continue to achieve good parallel performance speedup as the sizes of the system problem are increased More formal treatment of scalability later EECC756 Shaaban 2 lec 9 Spring2008 4 29 2008 Factors affecting Parallel System Performance Parallel Algorithm related i e Inherent Parallelism Available concurrency and profile dependency graph uniformity patterns Complexity and predictability of computational requirements Required communication synchronization uniformity and patterns Data size requirements Parallel program related Partitioning Decomposition and assignment to tasks Parallel task grain size Communication to computation ratio Programming model used Orchestration C to C ratio measure of inherent communication Cost of communication synchronization Resulting data code memory requirements locality and working set characteristics Mapping Scheduling Dynamic or static Hardware Architecture related Total CPU computational power available Parallel programming model support e g support for Shared address space Vs message passing support Architectural interactions artifactual extra communication Communication network characteristics Scalability topology Memory hierarchy properties Refined from factors in Lecture 1 EECC756 Shaaban 3 lec 9 Spring2008 4 29 2008 Parallel Performance Metrics Revisited Degree of Parallelism DOP For a given time period reflects the number of processors in a specific parallel computer actually executing a particular parallel program i e concurrency profile Average Parallelism A Given maximum parallelism m n homogeneous processors Computations sec Computing capacity of a single processor Total amount of work instructions or computations m t2 W DOP t dt or as a discrete summation W i t i 1 t1 i m Where ti is the total time that DOP i and The average parallelism A t2 1 A DOP t dt t 2 t1 t1 Execution Time From Lecture 3 In discrete form DOP Area t t t i 1 i m A i t i i 1 2 1 Execution Time m t i i 1 EECC756 Shaaban 4 lec 9 Spring2008 4 29 2008 Example Concurrency Profile of A Divide and Conquer Algorithm m m Execution observed from t1 2 to t2 27 A i i i Peak parallelism m 8 i 1 i 1 A 1x5 2x3 3x4 4x6 5x2 6x2 8x3 5 3 4 6 2 2 3 93 25 3 72 t Average Parallelism t Degree of Parallelism DOP 11 10 Concurrency Profile 9 8 7 6 5 4 3 2 Area equal to total of computations or work W 1 1 2 t1 3 4 5 6 From Lecture 3 7 8 9 10 11 12 13 14 Time 15 16 17 18 19 20 21 22 23 24 25 26 27 t2 EECC756 Shaaban 5 lec 9 Spring2008 4 29 2008 Parallel Performance Metrics Revisited Asymptotic Speedup more processors n than max software DOP m Execution time with one processor Execution time with an infinite number of available processors number of processors n or n m Asymptotic speedup S W T 1 t 1 m i 1 m i i i 1 W T t i m i 1 m i i i 1 m T 1 S T The above ignores all overheads Computing capacity of a single processor m maximum degree of software parallelism ti total time that DOP i Wi total work with DOP i i e Hardware parallelism n exceeds software parallelism m W m i 1 i W i i i 1 Keeping problem size fixed and ignoring parallelization overheads extra work EECC756 Shaaban 6 lec 9 Spring2008 4 29 2008 Phase Parallel Model of An Application Consider a sequential program of size s consisting of k computational phases C 1 Ck where each phase Ci has a degree of parallelism DOP i Assume single processor execution time of phase Ci T1 i i k T T i Total single processor execution time 1 k max DOP 1 i 1 i k T T i min i n Ignoring overheads n processor execution time n i 1 If all overheads are grouped as interaction Tinteract Synch Time Comm Cost and parallelism Tpar Extra Work as h s n Tinteract Tpar then parallel i k n number of processors execution time T T i min i n h s n n 1 i 1 1 Total overheads s problem size If k n and fi is the fraction of sequential execution time with DOP i fi i 1 2 n and ignoring overheads h s n 0 the speedup is 1 given by 1 S n S T n T n i 1 fi i 1 2 n for max DOP n is parallelism degree probability distribution DOP profile f i i EECC756 Shaaban 7 lec 9 Spring2008 4 29 2008 Harmonic Mean Speedup for n Execution Mode Multiprocessor system Fig 3 2 page 111 See handout EECC756 Shaaban 8 lec 9 Spring2008 4 29 2008 Parallel Performance Metrics Revisited Amdahl s Law Harmonic Mean Speedup i number of processors used f i is the fraction of sequential execution time with DOP i S n T 1 Tn 1 n i 1 f i i DOP 1 sequential DOP n In the case fi for i 1 2 n 0 0 1 the system is running sequential code with probability and utilizing n processors with probability 1 with other processor modes not utilized Amdahl s Law 1 S n 1 n 1 as n Keeping problem size fixed and ignoring overheads i e h s n 0 S Under these conditions the best speedup
View Full Document
Unlocking...