Slide 1OverviewGeneral GuidelinesWhat to doWhat to doWhat not to doWhat not to doTips For debuggingTips For debuggingCompiler InfoCompiler InfoCuda DisassemblerExample kernelExample Assembly (1.0)Example Assembly (1.3)Branching ExampleBranching AssemblyTexture CacheThe texture processor clusterTexture Memory“Binding” a texture (the simple way)Simple Example:Complicated method Part 1Complicated Method Part 2Complicated Method Part 3Profiling COdeCompute Visual ProfilerUser InterfaceProfiler Output ViewSummary TableInstruction ThroughputKernel, Memcopy Table ViewsPlotsSlide 34Summary of Important DatesMidterm Project: Progress Report [What’s Needed…]Final Project RelatedFinal Project RelatedSchedule HighlightsME964High Performance Computing for Engineering Applications“Adding manpower to a late software project makes it later”. Fred Brooks© Dan Negrut, 2011ME964 UW-MadisonCUDA Optimization Tips (Hammad Mazhar)Schedule related issuesMarch 22, 2011OverviewGeneral GuidelinesWhat to do and What not to doDebugging TipsCompilerAssemblyTexture usageUsing the profiler2GENERAL GUIDELINESWhat to do and what not to do3What to doUse fast math operations when possibleWaste a register rather than divide the same value multiple timesWhen multiplying/dividing by powers of two use bitshiftingUnroll loops that have a known size Inline simple (1/2 line) functions4What to doMax # of registers set to 32 by defaultProperties for cuda wizard or build rule–maxrregcount=N Forces compiler to use less or more registersExtra registers spill to local memoryGood: use 32 registers rather than 33More occupancy, usually fasterBad: use 32 registers rather than 60Too much local memory usage5What not to doAvoid double precision math where single precision is satisfactoryAvoid division / modulo operators if possibleAvoid static array declarations, compiler will (almost) always use lmemUsed shared memory if possible6What not to doAvoid Inlining large pieces of code, will cause local memory to be used unnecessarily.Avoid complex kernels that need many registersKeep kernels simpleSplit complex kernels to reduce register pressure7Tips For debuggingIf card is compute 2.0 use printf on devicecuPrintf might be useful for cards <2.0look in SDK for code and example“Invalidate” code by putting:If(threadIdx.x==-1){ …code here…}Prevents compiler from optimizing away code Move statement until problem found8Tips For debuggingChecking for execution errors:CUDA_SAFE_CALL(…);Will terminate code with reference to line of codeMeans that something before this call went wrongCUT_CHECK_ERROR(“ERROR MESSAGE”);Prints out user specified string if something went wrong. 9Compiler InfoCompiler is smart about optimizing codeTakes care of register reuseCombining math operationsFused multiply add (MAD)Delay global memory access until variable is actually usedRemove unused codeIf a variable is computed but never used it gets removed at compile time10Compiler InfoCompiler is not perfectReorganizing complex code manually can helpUse --ptxas-options=-v for extra infoShows info at compile time:Compiling entry function '_Z8kernel_exPi' for 'sm_13'Used 16 registers, 4 bytes lmem, 4+16 bytes smem, 4 bytes cmem[1]Useful when optimizing register usagedon’t need to run code to see changes11Cuda DisassemblerLook at what the compiler actually doesAssembly code is a bit tricky but can be followedcuobjdump.exe –dump-sass prog.exe >out.txtWrite assembly to out.txtUseful for making sure that memory reads and writes are optimized, fast math functions are used etc.12Example kernelLoad 4 integers in single 128 bit (16 byte) loadDo some math in a loopStore 4 integers in single 128 bit write__global__ void kernel (int4* A, int reps){uint index=blockIdx.x*blockDim.x+threadIdx.x;for(int i=0; i<reps; i++){int4 temp=A[index];temp.x=temp.y*temp.z*temp.w;A[index]=temp;}}13Example Assembly (1.0)Function : _Z8kernelP4int4i/*0000*/ ISET.S32.C0 o [0x7f], g [0x5], R124, LE;/*0008*/ RET C0.NE;/*0010*/ MOV.U16 R0H, g [0x1].U16;/*0018*/ I2I.U32.U16 R1, R0L;/*0020*/ IMAD.U16 R0, g [0x6].U16, R0H, R1;/*0028*/ SHL R0, R0, 0x4;/*0030*/ IADD R5, g [0x4], R0;/*0038*/ IADD32I R0, R5, 0xc;/*0040*/ GLD.U32 R4, global14 [R0];/*0048*/ MOV R6, R124;/*0050*/ GLD.S128 R0, global14 [R5]; /*0058*/ IMUL32.U16.U16 R3, R0L, R1H;/*005c*/ IMUL32.U16.U16 R7, R4L, R2H;/*0060*/ IMAD.U16 R3, R0H, R1L, R3;/*0068*/ IMAD.U16 R7, R4H, R2L, R7;/*0070*/ SHL R3, R3, 0x10;/*0078*/ SHL R7, R7, 0x10;/*0080*/ IMAD.U16 R0, R0L, R1L, R3;/*0088*/ IMAD.U16 R3, R4L, R2L, R7;/*0090*/ IMUL.U16.U16 R7, R0L, R3H;/*0098*/ IMAD.U16 R7, R0H, R3L, R7;/*00a0*/ SHL R7, R7, 0x10;/*00a8*/ IADD32I R6, R6, 0x1;/*00b0*/ IMAD.U16 R0, R0L, R3L, R7;/*00b8*/ MOV R3, R4;/*00c0*/ ISET.S32.C0 o [0x7f], g [0x5], R6, NE;/*00c8*/ GST.S128 global14 [R5], R0;/*00d0*/ BRA C0.NE, 0x50;/*00d8*/ NOP;14Example Assembly (1.3)Function : _Z8kernelP4int4i/*0000*/ ISET.S32.C0 o [0x7f], g [0x5], R124, LE;/*0008*/ RET C0.NE;/*0010*/ G2R.U16 R0H, g [0x1].U16;/*0018*/ I2I.U32.U16 R1, R0L;/*0020*/ IMAD.U16 R0, g [0x6].U16, R0H, R1;/*0028*/ SHL R0, R0, 0x4;/*0030*/ IADD R5, g [0x4], R0;/*0038*/ IADD32I R0, R5, 0xc;/*0040*/ GLD.U32 R4, global14 [R0];/*0048*/ MOV.SFU R6, R124;/*0050*/ GLD.S128 R0, global14 [R5];/*0058*/ IMUL32.U16.U16 R3, R0L, R1H;/*005c*/ IMUL32.U16.U16 R7, R4L, R2H;/*0060*/ IMAD.U16 R3, R0H, R1L, R3;/*0068*/ IMAD.U16 R7, R4H, R2L, R7;/*0070*/ SHL R3, R3, 0x10;/*0078*/ SHL R7, R7, 0x10;/*0080*/ IMAD.U16 R0, R0L, R1L, R3;/*0088*/ IMAD.U16 R3, R4L, R2L, R7;/*0090*/ IMUL.U16.U16 R7, R0L, R3H;/*0098*/ IMAD.U16 R7, R0H, R3L, R7;/*00a0*/ SHL R7, R7, 0x10;/*00a8*/ IADD32I R6, R6, 0x1;/*00b0*/ IMAD.U16 R0, R0L, R3L, R7;/*00b8*/ MOV R3, R4;/*00c0*/ ISET.S32.C0 o [0x7f], g [0x5], R6, NE;/*00c8*/ GST.S128 global14 [R5], R0;/*00d0*/ BRA C0.NE, 0x50;/*00d8*/ NOP;15Branching Example__global__ void kernel(int* data){if(threadIdx.x==0){data[threadIdx.x]=1;}else if(threadIdx.x==1){data[threadIdx.x]=2;}}16Branching AssemblyFunction : _Z8kernelPi/*0000*/ I2I.U32.U16.C0 R0, R0L;/*0008*/ BRA C0.NE, 0x38;/*0010*/ SHL R1, R0, 0x2;/*0018*/ MVI R0, 0x1;/*0020*/ IADD R1, g [0x4], R1;/*0028*/ GST.U32 global14 [R1], R0;/*0030*/ RET;/*0038*/ ISET.C0 o [0x7f], R0, c [0x1] [0x0], NE;/*0040*/ RET
View Full Document