Performance Technology Seth Goldstein CS 740 Course Goals Understanding computers Not a course on processor design Improve Use of computers Getting the most from the system Improving system performance How to build system Sept 3 2003 Ability to predict how computers will affect you Topics Understand structure of computer Understand trends that influence computer performance Course Info Performance measures Relating performance measures Memory Technology SRAM DRAM Disk Technology Understand abstraction and what lie behind them Based on slides by TCM Structure of Course CS 740 F 03 Performance expressed as a time Lectures Textbooks Schedule Homework Papers Participation Projects Absolute time measures difference between start and finish of an operation synonyms running time elapsed time response time latency completion time execution time most straightforward performance measure Relative normalized time measures running time normalized to some reference time e g time reference time Course has roughly 2 parts Guiding principle Choose performance measures that track running time Uniprocessors Multiprocessors Based on slides by TCM 2 3 CS 740 F 03 Based on slides by TCM 4 CS 740 F 03 Performance expressed as a rate Performance expressed as a rate cont Rates are performance measures expressed in units of work per unit time Key idea Report rates that track execution time Examples Example Suppose we are measuring a program that convolves a stream of images from a video camera millions of instructions sec MIPS millions of floating point instructions sec MFLOPS millions of bytes sec MBytes sec millions of bits sec Mbits sec images sec samples sec transactions sec TPS Bad performance measure MFLOPS number of floating point operations depends on the particular convolution algorithm n 2 matix vector product vs nlogn fast Fourier transform An FFT with a bad MFLOPS rate may run faster than a matrix vector product with a good MFLOPS rate Good performance measure images sec a program that runs faster will convolve more images per second 5 Based on slides by TCM CS 740 F 03 Performance expressed as a rate cont Example the i860 is advertised as having a peak rate of 80 MFLOPS 40 MHz with 2 flops per cycle However the measured performance of some compiled linear algebra kernels icc O2 tells a different story Based on slides by TCM 1d fft sasum 8 5 3 2 11 4 saxpy 6 1 7 7 sdot 10 3 13 6 CS 740 F 03 Relating time to system measures Suppose that for some program we have Fallacy Peak rates track running time Kernel MFLOPS peak Based on slides by TCM sgemm sgemv 6 2 15 0 8 19 spvma 8 1 10 CS 740 F 03 T seconds running time the ultimate performance measure C clock ticks I instructions P seconds tick performance measures of interest to the system designer T secs C ticks x P secs tick I inst I inst x C ticks x P secs tick T secs I inst x C ticks I inst x P secs tick running time instruction count Based on slides by TCM avg clock ticks per instruction CPI 8 clock period CS 740 F 03 Pipeline latency and throughput Video system performance Stage 1 In I3 I2 I1 video processing system N input images On O3 O2 O1 N output images L 3 secs image time R 1 L 1 3 images sec T L N 1 1 R 3N Latency L time to process an individual image Throughput R images processed per unit time 1 1 2 1 3 1 4 2 5 2 6 2 7 3 1 out 2 out One image can be processed by the system at any point in time 9 Based on slides by TCM CS 740 F 03 Pipelining the video system 10 Based on slides by TCM Pipelined video system performance video pipeline In I3 I2 I1 stage 1 buffer N input images L1 R1 stage 2 CPU stage 3 display L2 R2 L3 R3 On O3 O2 O1 N output images Based on slides by TCM L1 L 2 L3 1 One image can be in each stage at any point in time L 3 secs image Li latency of stage i Ri throughput of stage i R 1 image sec 11 Stage 1 Suppose Then T L N 1 1 R N 2 L L1 L2 L3 R min R1 R2 R3 CS 740 F 03 CS 740 F 03 Based on slides by TCM time Stage 2 Stage 3 1 1 2 2 1 3 3 2 1 4 4 3 2 1 out 5 5 4 3 2 out 6 6 5 4 3 out 7 7 6 5 4 out 12 CS 740 F 03 Relating time to latency throughput In general T L N 1 R The impact of latency and throughput on running time depends on N Amdahl s law You plan to visit a friend in Normandy France and must decide whether it is worth it to take the Concorde SST 3 100 or a 747 1 021 from NY to Paris assuming it will take 4 hours Pgh to NY and 4 hours Paris to Normandy N 1 T L N 1 T N R time NY Paris total trip time speedup over 747 To maximize throughput we should try to maximize the minimum throughput over all stages i e we strive for all stages to have equal throughput 13 Based on slides by TCM CS 740 F 03 747 SST 8 5 hours 3 75 hours CS 740 F 03 Amdahl s law cont T1 time that can NOT be enhanced T2 Old time T T1 T2 T2 time that can be enhanced New program enhanced T2 time after the enhancement T2 T2 New time T T1 T2 Fenhanced T2 T Senhanced T2 T2 fraction of original time that can be improved speedup of enhanced part T T1 T2 T1 T2 T 1 Fenhanced T2 T 1 Fenhanced T2 Senhanced T 1 Fenhanced T Fenhanced Senhanced T 1 Fenhanced Fenhanced Senhanced by def of Senhanced by def of Fenhanced Amdahl s Law Soverall T T 1 1 Fenhanced Fenhanced Senhanced Speedup Soverall T T Based on slides by TCM 14 Based on slides by TCM Two key parameters Old program unenhanced T1 T1 1 1 4 Taking the SST which is 2 2 times faster speeds up the overall trip by only a factor of 1 4 Amdahl s law cont T1 16 5 hours 11 75 hours Key idea Amdahl s law quantifies the general notion of diminishing returns It applies to any activity not just computer programs 15 CS 740 F 03 Based on slides by TCM 16 CS 740 F 03 Amdahl s law cont Amdahl s law cont Trip example Suppose that for the New York to Paris leg we now consider the possibility of taking a rocket ship 15 minutes or a handy rip in the fabric of space time 0 minutes time NY Paris 747 8 5 hours SST 3 75 hours rocket 0 25 hours rip 0 0 hours total trip time 16 5 hours 11 75 hours 8 25 hours 8 hours speedup over 747 1 1 4 …
View Full Document
Unlocking...