Spring 2011 Prof. Hyesoon Kim• Floating point multiply and add operation: – 2 FP operations • Please look at PTX instructions • You might not get what the device query says: explain why… • Objdump will provide more precise results but for this assignment, just use ptx. • Arithmetic Intensity: math operations per memory op = Sum of FP operations/ Sum of # of transferred bytes• Register read is fully pipelined. • Back-to-back operation is in the critical path• ILP across warps (~= TLP) can hide the latency of back-to-backR1= R2+R3R4= R1+R4R1= R2+R3R4= R1+R4R1= R2+R3R4= R1+R4R1= R2+R3R4= R1+R4R1= R2+R3R4= R1+R4R1= R2+R3R4= R1+R4R1= R2+R3R4= R1+R41 warp 24 cycles delay between 2 insts1 warp 24 cycle delay is hidden by TLPw0wNw1loop{a = a+c;} dependent instructions across loops• Any performance difference? • DRAM row buffer hit and miss will make a big difference for (ii=0; ii<2000; ++ii) {ref=base + (16*ii)+tx; sh_ref=base+(16*ii)+tx;temp[sh_ref] = dm[ref]; }for (ii=0; ii<2000; ++ii) {ref=base + tx; sh_ref=base+tx;temp[sh_ref] = dm[ref]; }• coalescingt0 t1 t2 t3. . . 128 132 136 140 144All threads participatet14 t15. . . 184 188 192• Uncoalescing (Braid’s lab)t0 t1 t2 t3. . . 128 132 136 140 144All threads participatet14 t15. . . 184 188 192Vary starting distance• Mem addr = (tid)*X+Y + ii (loop iteration) • And vary X and Y to generate different access patterns t0 t1 t2 t3. . . 128 132 136 140 144. . . 184 188 192b bSRAMwordlinebDRAMwordlineRow DecoderSense AmpsColumn DecoderMemoryCell ArrayRow BufferRowAddressColumnAddressData Bus1VddWordline EnabledSense Amp EnabledbitlinevoltageVddstoragecell voltagesense amp0After read of 0 or 1, cell containssomething close to 1/2DRAM refresh is necessary to keep the data as well• Row buffer hit and miss penalty • CAS+RAS+Precharge• CAS• Bank conflicts • DRAM access time varies 10x• Lab #2: 7% 10%. • Friday 6 pm: Extra 10%. • Extended due: Monday 6 pm• One more pole for make-up class. • Newsgroup participation will provide bonus points
View Full Document