UNO CSCI 8150 - Principles of Scalable Performance

Unformatted text preview:

CSCI 8150 Advanced Computer ArchitectureDegree of ParallelismExample Parallelism ProfileAverage Parallelism - 1Average Parallelism - 2Average Parallelism - 3Available ParallelismBasic BlocksAsymptotic Speedup - 1Asymptotic Speedup - 2Mean Performance CalculationArithmetic MeanGeometric MeanHarmonic MeanWeighted Harmonic MeanWeighted Harmonic Mean SpeedupAmdahl’s LawSystem Efficiency – 1System Efficiency – 2RedundancySystem UtilizationQuality of ParallelismStandard Industry Performance MeasuresDoing What?What’s VAX Got To Do With It?Other MeasuresCSCI 8150Advanced Computer ArchitectureHwang, Chapter 3Principles of Scalable Performance3.1 Performance Metrics and MeasuresDegree of ParallelismThe number of processors used at any instant to execute a program is called the degree of parallelism (DOP); this can vary over time.DOP assumes an infinite number of processors are available; this is not achievable in real machines, so some parallel program segments must be executed sequentially as smaller parallel segments. Other resources may impose limiting conditions.A plot of DOP vs. time is called a parallelism profile.Example Parallelism ProfileTime DOPAverageParallelismt1t2Average Parallelism - 1Assume the following:n homogeneous processorsmaximum parallelism in a profile is mIdeally, n >> m, the computing capacity of a processor, is something like MIPS or Mflops w/o regard for memory latency, etc.i is the number of processors busy in an observation period (e.g. DOP = i )W is the total work (instructions or computations) performed by a programA is the average parallelism in the programAverage Parallelism - 221)(ttdttDOPWmiitiW1where ti = total time that DOP = i, andmiittt112Average Parallelism - 321)(112ttdttDOPttAmiimiittiA11/Available ParallelismVarious studies have shown that the potential parallelism in scientific and engineering calculations can be very high (e.g. hundreds or thousands of instructions per clock cycle).But in real machines, the actual parallelism is much smaller (e.g. 10 or 20).Basic BlocksA basic block is a sequence or block of instructions with one entry and one exit.Basic blocks are frequently used as the focus of optimizers in compilers (since its easier to manage the use of registers utilized in the block).Limiting optimization to basic blocks limits the instruction level parallelism that can be obtained (to about 2 to 5 in typical code).Asymptotic Speedup - 1miiWW1iitiW (work done when DOP = i)(relates sum of Wi terms to W) kWktii/)((execution time with k processors) iWtii/)((for 1  i  m)Asymptotic Speedup - 2  mimiiiWtT1 1)1()1((resp. time w/ 1 proc.)  mimiiiiWtT1 1)()((resp. time w/  proc.)AiWWTTSmiimii11/)()1((in the ideal case)Mean Performance CalculationWe seek to obtain a measure that characterizes the mean, or average, performance of a set of benchmark programs with potentially many different execution modes (e.g. scalar, vector, sequential, parallel).We may also wish to associate weights with these programs to emphasize these different modes and yield a more meaningful performance measure.Arithmetic MeanThe arithmetic mean is familiar (sum of the terms divided by the number of terms).Our measures will use execution rates expressed in MIPS or Mflops.The arithmetic mean of a set of execution rates is proportional to the sum of the inverses of the execution times; it is no t inversely proportional to the sum of the execution times.Thus arithmetic mean fails to represent real times consumed by the benchmarks when executed.Geometric MeanA geometric mean of n terms is the nth root of the product of the n terms.Like the arithmetic mean, the geometric mean of a set of execution rates does not have an inverse relationship with the total execution time of the programs.(Geometric mean has been advocated for use with normalized performance numbers for comparison with a reference machine.)Harmonic MeanInstead of using arithmetic or geometric mean, we use the harmonic mean execution rate, which is just the inverse of the arithmetic mean of the execution time (thus guaranteeing the inverse relation not exhibited by the other means). miihRmR1/1Weighted Harmonic MeanIf we associate weights fi with the benchmarks, then we can compute the weighted harmonic mean: miiihRfmR1/Weighted Harmonic Mean SpeedupT1 = 1/R1 = 1 is the sequential execution time on a single processor with rate R1 = 1.Ti = 1/Ri = 1/i = is the execution time using i processors with a combined execution rate of Ri = i.Now suppose a program has n execution modes with associated weights f1 … fn. The weighted harmonic mean speedup is defined as: ( )*111//ni iiS T Tf R== =�* *1/hT R=(weighted arithmetic mean execution time)Amdahl’s LawAssume Ri = i, and w (the weights) are (, 0, …, 0, 1-).Basically this means the system is used sequentially (with probability ) or all n processors are used (with probability 1- ).This yields the speedup equation known as Amdahl’s law:( )1 1nnSn a=+ -The implication is that the best speedup possible is 1/ , regardless of n, the number of processors.System Efficiency – 1Assume the following definitions:O (n) = total number of “unit operations” performed by an n-processor system in completing a program P.T (n) = execution time required to execute the program P on an n-processor system.O (n) can be considered similar to the total number of instructions executed by the n processors, perhaps scaled by a constant factor.If we define O (1) = T (1), then it is logical to expect that T (n) < O (n) when n > 1 if the program P is able to make any use at all of the extra processor(s).System Efficiency – 2Clearly, the speedup factor (how much faster the program runs with n processors) can now be expressed asS (n) = T (1) / T (n)Recall that we expect T (n) < T (1), so S (n)  1.System efficiency is defined asE (n) = S (n) / n = T (1) / ( n  T (n) )It indicates the actual degree of speedup achieved in a system as compared with the maximum possible speedup. Thus 1 / n  E (n)  1. The value is 1/n when only one processor is used (regardless of n), and the value is 1 when all processors are fully utilized.RedundancyThe redundancy in a parallel computation is defined asR (n) = O (n) / O (1)What


View Full Document

UNO CSCI 8150 - Principles of Scalable Performance

Download Principles of Scalable Performance
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Principles of Scalable Performance and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Principles of Scalable Performance 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?