Spring 2011 Prof. Hyesoon Kim• Attainable GFLOP/sec = min {Peak Floating-Point performance , Peak memory bandwidth X Operational Intensity} Intensity (flop : byte)Gflop/s481632641282565121024(3.3, 86)(3.3, 171)(9.1, 933)●(7.2, 1030)1/8 1/4 1/2 1 2 4 8 16Platform●a●a●a●aFermiaaaaC1060aaaaNehalem x 2aaaaNehalem• N : mean number of tasks in system• : arrival rate • L : latency • Mean number of tasks in system = arrival rate x mean response time • Q: Memory latency is 500 cycles. Memory requests are sent every 5thcycle, how many requests are in the memory system? • 500*1/5 = 100, on average 100 memory requests are in the system • Or, every 5thcycle memory requests are sent and there are 100 memory requests are in the flight: what will be the memory latency? LN• Memory latency is 500 cycles. Each warp (32 threads) generates 1 memory request every 5thinstructions. • How many warps do we need to hide memory latency? – Assume that we have only one core and as many as warps possible. – : 1/5 instructions, L=500 latency N= 100 warps – If there is a batch, the Number will be reduced by batch size. Shebanow’s limiter’s theory• Principle: – Little’s Law:– N = “number in flight”, = arrival rate, L = memory latency• Arrival Rate product of:– Desired execution rate (IPC)– Density of LOAD instructions (%)• N = # of threads needed to cover latency L• Use batching• Group independent LDs together• Modified law:– B = batch sizeLNM. Shebanow, NVIDIA’10BLN• LD, Dep inst, LD, Dep inst ….• 1 warp • 4 warps • 2 warps, load hoist (batch size 2): How to increase batch size? MMBLNL = 8 , 1/2, C MMC MMC MMCMMC MCMMC MCMMC MCMMC MCMMC M CMC M MCMMC M CMC M MC• Performance Modeling and analysis for G80 architecture• Design decision based on benchmark characteristics • Xbox 360 optimization techniques (just describe)• Bring your calculator • # of questions:~= 3• Clock frequency – Designing factors – Circuit technology – Memory latency, IPC • Pipeline depth decisions • Power consideration• Area considerations • SFU? • SIMD unit and SIMD width?• App1: ILP = 0.5, ILP<ooo> = 0.7 TLP= 4 (100% parallelizable), FP_MUL= 20% of instructions, ILP only FP= 32, cache hit ratio trend 256KB = 40%, 512KB=50%, 1M = 80%, 2M = 90%, 20% Mem insts• App2: ILP=0.25, ILP <ooo> = 0.5 TLP = 8 (100% parallelizable) FP_MUL=5%, cache hit ratio trend 256KB = 30%, 512KB=30%, 1M = 50%, 2M = 50%, ILP only FP= 16, 20% Mem insts• Budget :200 units (area), 1 level cache: latency ( 256K = 5, 512K=7, 1M = 9, 2M = 20} , mem latency = 100 cycle, pipeline depth = 9 cycles + execution latency (1 for INT) • ooo-core w/o cache:2w: 40, 2w: 50 ooo-SMT core w/o cache: w2: 55 • In-core w/o cache: 1w:20, 2w: 30, in-core SMT w/o cache 2w: 35 • cache size 256KB = 20 units, FMUL(10 latency) 0.2 per 1FP, FMUL (2 latency) 0.5 per 1FP• Critical path = 4, • ILP = 8/4 ~= 2InstructionTypeFrequency CPIInteger 40% 1.0Branch 20% 4.0Load 20% 2.0Store 10% 3.0 timecycleClock CPI IC timeCPU n1iiiTotal Insts = 50B, Clock speed = 2 GHz= (0.4*1.0 + 0.2*4.0+0.2*2.0 + 0.1*3.0) * 50
View Full Document