UW-Madison ME 964 - CUDA Optimization Tips - D2259364

Home> Schools> University of Wisconsin, Madison> Mechanical Engineering (ME) > ME 964> CUDA Optimization Tips

DOC PREVIEW

UW-Madison ME 964 - CUDA Optimization Tips

School name University of Wisconsin, Madison

Course Me 964- High Performance Computing for Engineering Applications

Pages 39

This preview shows page 1-2-3-18-19-37-38-39 out of 39 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 39 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 39 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 39 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 39 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 39 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 39 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 39 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 39 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 39 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Slide 1OverviewGeneral GuidelinesWhat to doWhat to doWhat not to doWhat not to doTips For debuggingTips For debuggingCompiler InfoCompiler InfoCuda DisassemblerExample kernelExample Assembly (1.0)Example Assembly (1.3)Branching ExampleBranching AssemblyTexture CacheThe texture processor clusterTexture Memory“Binding” a texture (the simple way)Simple Example:Complicated method Part 1Complicated Method Part 2Complicated Method Part 3Profiling COdeCompute Visual ProfilerUser InterfaceProfiler Output ViewSummary TableInstruction ThroughputKernel, Memcopy Table ViewsPlotsSlide 34Summary of Important DatesMidterm Project: Progress Report [What’s Needed…]Final Project RelatedFinal Project RelatedSchedule HighlightsME964High Performance Computing for Engineering Applications“Adding manpower to a late software project makes it later”. Fred Brooks© Dan Negrut, 2011ME964 UW-MadisonCUDA Optimization Tips (Hammad Mazhar)Schedule related issuesMarch 22, 2011OverviewGeneral GuidelinesWhat to do and What not to doDebugging TipsCompilerAssemblyTexture usageUsing the profiler2GENERAL GUIDELINESWhat to do and what not to do3What to doUse fast math operations when possibleWaste a register rather than divide the same value multiple timesWhen multiplying/dividing by powers of two use bitshiftingUnroll loops that have a known size Inline simple (1/2 line) functions4What to doMax # of registers set to 32 by defaultProperties for cuda wizard or build rule–maxrregcount=N Forces compiler to use less or more registersExtra registers spill to local memoryGood: use 32 registers rather than 33More occupancy, usually fasterBad: use 32 registers rather than 60Too much local memory usage5What not to doAvoid double precision math where single precision is satisfactoryAvoid division / modulo operators if possibleAvoid static array declarations, compiler will (almost) always use lmemUsed shared memory if possible6What not to doAvoid Inlining large pieces of code, will cause local memory to be used unnecessarily.Avoid complex kernels that need many registersKeep kernels simpleSplit complex kernels to reduce register pressure7Tips For debuggingIf card is compute 2.0 use printf on devicecuPrintf might be useful for cards <2.0look in SDK for code and example“Invalidate” code by putting:If(threadIdx.x==-1){ …code here…}Prevents compiler from optimizing away code Move statement until problem found8Tips For debuggingChecking for execution errors:CUDA_SAFE_CALL(…);Will terminate code with reference to line of codeMeans that something before this call went wrongCUT_CHECK_ERROR(“ERROR MESSAGE”);Prints out user specified string if something went wrong. 9Compiler InfoCompiler is smart about optimizing codeTakes care of register reuseCombining math operationsFused multiply add (MAD)Delay global memory access until variable is actually usedRemove unused codeIf a variable is computed but never used it gets removed at compile time10Compiler InfoCompiler is not perfectReorganizing complex code manually can helpUse --ptxas-options=-v for extra infoShows info at compile time:Compiling entry function '_Z8kernel_exPi' for 'sm_13'Used 16 registers, 4 bytes lmem, 4+16 bytes smem, 4 bytes cmem[1]Useful when optimizing register usagedon’t need to run code to see changes11Cuda DisassemblerLook at what the compiler actually doesAssembly code is a bit tricky but can be followedcuobjdump.exe –dump-sass prog.exe >out.txtWrite assembly to out.txtUseful for making sure that memory reads and writes are optimized, fast math functions are used etc.12Example kernelLoad 4 integers in single 128 bit (16 byte) loadDo some math in a loopStore 4 integers in single 128 bit write__global__ void kernel (int4* A, int reps){uint index=blockIdx.x*blockDim.x+threadIdx.x;for(int i=0; i<reps; i++){int4 temp=A[index];temp.x=temp.y*temp.z*temp.w;A[index]=temp;}}13Example Assembly (1.0)Function : _Z8kernelP4int4i/*0000*/ ISET.S32.C0 o [0x7f], g [0x5], R124, LE;/*0008*/ RET C0.NE;/*0010*/ MOV.U16 R0H, g [0x1].U16;/*0018*/ I2I.U32.U16 R1, R0L;/*0020*/ IMAD.U16 R0, g [0x6].U16, R0H, R1;/*0028*/ SHL R0, R0, 0x4;/*0030*/ IADD R5, g [0x4], R0;/*0038*/ IADD32I R0, R5, 0xc;/*0040*/ GLD.U32 R4, global14 [R0];/*0048*/ MOV R6, R124;/*0050*/ GLD.S128 R0, global14 [R5]; /*0058*/ IMUL32.U16.U16 R3, R0L, R1H;/*005c*/ IMUL32.U16.U16 R7, R4L, R2H;/*0060*/ IMAD.U16 R3, R0H, R1L, R3;/*0068*/ IMAD.U16 R7, R4H, R2L, R7;/*0070*/ SHL R3, R3, 0x10;/*0078*/ SHL R7, R7, 0x10;/*0080*/ IMAD.U16 R0, R0L, R1L, R3;/*0088*/ IMAD.U16 R3, R4L, R2L, R7;/*0090*/ IMUL.U16.U16 R7, R0L, R3H;/*0098*/ IMAD.U16 R7, R0H, R3L, R7;/*00a0*/ SHL R7, R7, 0x10;/*00a8*/ IADD32I R6, R6, 0x1;/*00b0*/ IMAD.U16 R0, R0L, R3L, R7;/*00b8*/ MOV R3, R4;/*00c0*/ ISET.S32.C0 o [0x7f], g [0x5], R6, NE;/*00c8*/ GST.S128 global14 [R5], R0;/*00d0*/ BRA C0.NE, 0x50;/*00d8*/ NOP;14Example Assembly (1.3)Function : _Z8kernelP4int4i/*0000*/ ISET.S32.C0 o [0x7f], g [0x5], R124, LE;/*0008*/ RET C0.NE;/*0010*/ G2R.U16 R0H, g [0x1].U16;/*0018*/ I2I.U32.U16 R1, R0L;/*0020*/ IMAD.U16 R0, g [0x6].U16, R0H, R1;/*0028*/ SHL R0, R0, 0x4;/*0030*/ IADD R5, g [0x4], R0;/*0038*/ IADD32I R0, R5, 0xc;/*0040*/ GLD.U32 R4, global14 [R0];/*0048*/ MOV.SFU R6, R124;/*0050*/ GLD.S128 R0, global14 [R5];/*0058*/ IMUL32.U16.U16 R3, R0L, R1H;/*005c*/ IMUL32.U16.U16 R7, R4L, R2H;/*0060*/ IMAD.U16 R3, R0H, R1L, R3;/*0068*/ IMAD.U16 R7, R4H, R2L, R7;/*0070*/ SHL R3, R3, 0x10;/*0078*/ SHL R7, R7, 0x10;/*0080*/ IMAD.U16 R0, R0L, R1L, R3;/*0088*/ IMAD.U16 R3, R4L, R2L, R7;/*0090*/ IMUL.U16.U16 R7, R0L, R3H;/*0098*/ IMAD.U16 R7, R0H, R3L, R7;/*00a0*/ SHL R7, R7, 0x10;/*00a8*/ IADD32I R6, R6, 0x1;/*00b0*/ IMAD.U16 R0, R0L, R3L, R7;/*00b8*/ MOV R3, R4;/*00c0*/ ISET.S32.C0 o [0x7f], g [0x5], R6, NE;/*00c8*/ GST.S128 global14 [R5], R0;/*00d0*/ BRA C0.NE, 0x50;/*00d8*/ NOP;15Branching Example__global__ void kernel(int* data){if(threadIdx.x==0){data[threadIdx.x]=1;}else if(threadIdx.x==1){data[threadIdx.x]=2;}}16Branching AssemblyFunction : _Z8kernelPi/*0000*/ I2I.U32.U16.C0 R0, R0L;/*0008*/ BRA C0.NE, 0x38;/*0010*/ SHL R1, R0, 0x2;/*0018*/ MVI R0, 0x1;/*0020*/ IADD R1, g [0x4], R1;/*0028*/ GST.U32 global14 [R1], R0;/*0030*/ RET;/*0038*/ ISET.C0 o [0x7f], R0, c [0x1] [0x0], NE;/*0040*/ RET

View Full Document