Performance Technology Seth Goldstein CS 740 Sept 11 2002 Topics Course Info Performance measures Relating performance measures Memory Technology SRAM DRAM Disk Technology Course Goals Understanding computers Not a course on processor design Improve Use of computers Getting the most from the system Improving system performance How to build system Ability to predict how computers will affect you Understand structure of computer Understand trends that influence computer performance Understand abstraction and what lie behind them Based on slides by TCM 2 Page 1 CS 740 F 02 Structure of Course Lectures Schedule Homeworks Papers Participation Projects Course has roughly 2 parts Uniprocessors Multiprocessors Based on slides by TCM 3 CS 740 F 02 Performance expressed as a time Absolute time measures difference between start and finish of an operation synonyms running time elapsed time response time latency completion time execution time most straightforward performance measure Relative normalized time measures running time normalized to some reference time e g time reference time Guiding principle Choose performance measures that track running time Based on slides by TCM 4 Page 2 CS 740 F 02 Performance expressed as a rate Rates are performance measures expressed in units of work per unit time Examples millions of instructions sec MIPS millions of floating point instructions sec MFLOPS millions of bytes sec MBytes sec millions of bits sec Mbits sec images sec samples sec transactions sec TPS Based on slides by TCM 5 CS 740 F 02 Performance expressed as a rate cont Key idea Report rates that track execution time Example Suppose we are measuring a program that convolves a stream of images from a video camera Bad performance measure MFLOPS number of floating point operations depends on the particular convolution algorithm n 2 matix vector product vs nlogn fast Fourier transform An FFT with a bad MFLOPS rate may run faster than a matrix vector product with a good MFLOPS rate Good performance measure images sec a program that runs faster will convolve more images per second Based on slides by TCM 6 Page 3 CS 740 F 02 Performance expressed as a rate cont Fallacy Peak rates track running time Example the i860 is advertised as having a peak rate of 80 MFLOPS 40 MHz with 2 flops per cycle However the measured performance of some compiled linear algebra kernels icc O2 tells a different story Kernel MFLOPS peak 1d fft sasum 8 5 3 2 11 4 Based on slides by TCM saxpy 6 1 7 sdot 10 3 13 sgemm sgemv 6 2 15 0 8 19 7 spvma 8 1 10 CS 740 F 02 Relating time to system measures Suppose that for some program we have T seconds running time the ultimate performance measure C clock ticks I instructions P seconds tick performance measures of interest to the system designer T secs C ticks x P secs tick I inst I inst x C ticks x P secs tick T secs I inst x C ticks I inst x P secs tick running time instruction count Based on slides by TCM avg clock ticks per instruction CPI 8 Page 4 clock period CS 740 F 02 Pipeline latency and throughput In I3 I2 I1 On O3 O2 O1 video processing system N input images N output images Latency L time to process an individual image Throughput R images processed per unit time One image can be processed by the system at any point in time Based on slides by TCM 9 CS 740 F 02 Video system performance Stage 1 L 3 secs image time R 1 L 1 3 images sec T L N 1 1 R 3N Based on slides by TCM 10 Page 5 1 1 2 1 3 1 4 2 5 2 6 2 7 3 1 out 2 out CS 740 F 02 Pipelining the video system video pipeline In I3 I2 I1 stage 1 buffer N input images L1 R1 stage 2 CPU stage 3 display L2 R2 L3 R3 On O3 O2 O1 N output images One image can be in each stage at any point in time Li latency of stage i Ri throughput of stage i L L1 L2 L3 R min R1 R2 R3 11 Based on slides by TCM CS 740 F 02 Pipelined video system performance Stage 1 Suppose Stage 2 Stage 3 L1 L2 L3 1 1 1 2 2 1 Then 3 3 2 1 4 4 3 2 1 out 5 5 4 3 2 out 6 6 5 4 3 out 7 7 6 5 4 out L 3 secs image R 1 image sec T L N 1 1 R N 2 Based on slides by TCM time 12 Page 6 CS 740 F 02 Relating time to latency throughput In general T L N 1 R The impact of latency and throughput on running time depends on N N 1 T L N 1 T N R To maximize throughput we should try to maximize the minimum throughput over all stages i e we strive for all stages to have equal throughput Based on slides by TCM 13 CS 740 F 02 Amdahl s law You plan to visit a friend in Normandy France and must decide whether it is worth it to take the Concorde SST 3 100 or a 747 1 021 from NY to Paris assuming it will take 4 hours Pgh to NY and 4 hours Paris to Normandy time NY Paris total trip time speedup over 747 747 SST 8 5 hours 3 75 hours 16 5 hours 11 75 hours 1 1 4 Taking the SST which is 2 2 times faster speeds up the overall trip by only a factor of 1 4 Based on slides by TCM 14 Page 7 CS 740 F 02 Amdahl s law cont Old program unenhanced T1 T1 time that can NOT be enhanced T2 Old time T T1 T2 T2 time that can be enhanced New program enhanced T1 T1 T2 time after the enhancement T2 T2 New time T T1 T2 Speedup Soverall T T 15 Based on slides by TCM CS 740 F 02 Amdahl s law cont Two key parameters Fenhanced T2 T Senhanced T2 T2 fraction of original time that can be improved speedup of enhanced part T T1 T2 T1 T2 T 1 Fenhanced T2 T 1 Fenhanced T2 Senhanced T 1 Fenhanced T Fenhanced Senhanced T 1 Fenhanced Fenhanced Senhanced by def of Senhanced by def of Fenhanced Amdahl s Law Soverall T T 1 1 Fenhanced Fenhanced Senhanced Key idea Amdahl s law quantifies the general notion of diminishing returns It applies to any activity not just computer programs Based on slides by TCM 16 Page 8 CS 740 F 02 Amdahl s law cont Trip example Suppose that for the New York to Paris leg we now consider the possibility of taking a rocket ship 15 minutes or a handy rip in the fabric of space time 0 minutes time NY Paris 747 8 5 hours SST 3 75 hours rocket 0 25 hours rip 0 0 hours total trip time 16 5 hours 11 75 hours 8 …
View Full Document
Unlocking...