Page 1Performance & TechnologyTodd C. MowryCS 740Sept 11, 2007Topics:• Performance measures• Relating performance measures• Memory Technology–SRAM, DRAM• Disk Technology• Recent Processor TrendsCS 740 F’072Performance expressed as a timeAbsolute time measures• difference between start and finish of an operation• synonyms: running time, elapsed time, response time, latency, completion time, execution time• most straightforward performance measureRelative (normalized) time measures• running time normalized to some reference time • (e.g. time/reference time)Guiding principle: Choose performance measures that track running time.CS 740 F’073Performance expressed as a rateRates are performance measures expressed in units of work per unit time.Examples:• millions of instructions / sec (MIPS)• millions of floating point instructions / sec (MFLOPS)• millions of bytes / sec (MBytes/sec)• millions of bits / sec (Mbits/sec)• images / sec• samples / sec• transactions / sec (TPS)CS 740 F’074Performance expressed as a rate(cont)Key idea: Report rates that track execution time.Example: Suppose we are measuring a program that convolves a stream of images from a video camera.Bad performance measure: MFLOPS• number of floating point operations depends on the particular convolution algorithm: n^2 matix-vector product vs nlogn fast Fourier transform. An FFT with a bad MFLOPS rate may run faster than a matrix-vector product with a good MFLOPS rate.Good performance measure: images/sec• a program that runs faster will convolve more images per second.Page 2CS 740 F’075Performance expressed as a rate(cont)Fallacy: Peak rates track running time.Example: the i860 is advertised as having a peak rate of 80 MFLOPS (40 MHz with 2 flops per cycle). However, the measured performance of some compiled linear algebra kernels (icc -O2) tells a different story:Kernel 1d fft sasum saxpy sdot sgemm sgemv spvmaMFLOPS 8.5 3.2 6.1 10.3 6.2 15.0 8.1%peak 11% 4% 7% 13% 8% 19% 10%CS 740 F’076Relating time to system measuresSuppose that for some program we have: • T seconds running time (the ultimate performance measure)• C clock ticks, I instructions, P seconds/tick (performance measures of interest to the system designer)T secs = C ticks x P secs/tick= (I inst/I inst) x C ticks x P secs/tickT secs = I inst x (C ticks/I inst) x P secs/tickrunningtimeinstructioncountavg clockticks perinstruction(CPI)clock periodCS 740 F’077Pipeline latency and throughputvideo processing system(N input images)In,...,I3, I2, I1(N output images)On,...,O3, O2, O1Latency (L): time to process an individual image.Throughput (R): images processed per unit timeOne image can be processed by the system at any point in timeCS 740 F’078Video system performanceL = 3 secs/image.R = 1/L = 1/3 images/sec.T = L + (N-1)1/R= 3Ntime1Stage 123456711122231 out2 outPage 3CS 740 F’079Pipelining the video systemstage 1(buffer)video pipeline(L1,R1)(L3,R3)(L2,R2)stage 3(display)stage 2(CPU)(N input images)In,...,I3, I2, I1(N output images)On,...,O3, O2, O1One image can be in each stage at any point in time. Li= latency of stage iRi= throughput of stage iL = L1+ L2+ L3R = min(R1, R2, R3)CS 740 F’0710Pipelined video system performancetime1Suppose:L1= L2= L3= 1Then:L = 3 secs/image.R = 1 image/sec.T = L + (N-1)1/R= N + 2Stage 1 Stage 2 Stage 32345671234567123456123451 out2 out3 out4 outCS 740 F’0711Relating time to latency & throughputIn general:• T = L + (N-1)/RThe impact of latency and throughput on running time depends on N:• (N = 1) => (T = L)• (N >> 1) => (T = N/R)To maximize throughput, we should try to maximize the minimum throughput over all stages (i.e., we strive for all stages to have equal throughput).CS 740 F’0712Amdahl’s lawYou plan to visit a friend in Normandy France and must decide whether it is worth it to take the Concorde SST ($3,100) or a 747 ($1,021) from NY to Paris, assuming it will take 4 hours Pgh to NY and 4 hours Paris to Normandy.time NY->Paris total trip time speedup over 747747 8.5 hours 16.5 hours 1SST 3.75 hours 11.75 hours 1.4Taking the SST (which is 2.2 times faster) speeds up the overall trip by only a factor of 1.4!Page 4CS 740 F’0713Amdahl’s law (cont)T1T2Old program (unenhanced)T1= time that can NOTbe enhanced.T2= time that can beenhanced.T2’ = time after theenhancement. Old time: T = T1+ T2T1’= T1T2’ <= T2New program (enhanced)New time: T’ = T1’+ T2’Speedup: Soverall= T / T’CS 740 F’0714Amdahl’s law (cont)Two key parameters: Fenhanced= T2/ T (fraction of original time that can be improved)Senhanced= T2/ T2’ (speedup of enhanced part)T’ = T1’+ T2’= T1+ T2’ = T(1-Fenhanced) + T2’= T(1-Fenhanced) + (T2/Senhanced) [by def of Senhanced]= T(1-Fenhanced) + T(Fenhanced/Senhanced) [by def of Fenhanced]= T((1-Fenhanced) + Fenhanced/Senhanced)Amdahl’s Law:Soverall= T / T’ = 1/((1-Fenhanced) + Fenhanced/Senhanced)Key idea: Amdahl’s law quantifies the general notion of diminishing returns. It applies to any activity, not just computer programs.CS 740 F’0715Amdahl’s law (cont)Trip example: Suppose that for the New York to Paris leg, we now consider the possibility of taking a rocket ship (15 minutes) or a handy rip in the fabric of space-time (0 minutes):time NY->Paris total trip time speedup over 747747 8.5 hours 16.5 hours 1SST 3.75 hours 11.75 hours 1.4rocket 0.25 hours 8.25 hours 2.0rip 0.0 hours 8 hours 2.1CS 740 F’0716Amdahl’s law (cont)Useful corollary to Amdahl’s law:• 1 <= Soverall<= 1 / (1 - Fenhanced)FenhancedMax SoverallFenhancedMax Soverall0.0 1 0.9375 160.5 2 0.96875 320.75 4 0.984375 640.875 8 0.9921875 128Moral: It is hard to speed up a program.Moral++ : It is easy to make premature optimizations.Page 5CS 740 F’0717Computer SystemDiskDiskMemory-I/O busMemory-I/O busProcessorProcessorCacheCacheMemoryMemoryI/OcontrollerI/OcontrollerI/OcontrollerI/OcontrollerI/OcontrollerI/OcontrollerDisplayDisplayNetworkNetworkRegCS 740 F’0718Levels in Memory HierarchyCPUCPUregsregsCacheMemoryMemorydiskdisksize:speed:$/Mbyte:block size:200 B3 ns8 BRegister Cache Memory Disk Memory32 KB / 4MB6 ns$100/MB32 B128 MB60 ns$1.50/MB8 KB20 GB8 ms$0.05/MBlarger, slower, cheaper8 B 32 B 8 KBcache virtual memoryCS 740 F’0719Scaling to 0.1µm• Semiconductor Industry Association, 1992 Technology Workshop– Projected future technology based on past trends19921995 1998 2001 2004 2007Feature
View Full Document