Slide 1Before We Get Started…Global Memory Access Compute Capability 1.3Examples [Preamble]Example: Scenario 1Example: Scenario 2Example: Scenario 3Example: Scenario 4Example: Scenario 4 [Cntd.]Looking Beyond Tesla C1060Looking Beyond Tesla C1060 [Cntd.]Think about this…Technical Specifications and Features [Short Detour]Slide 14Discuss Shared Memory IssuesShared Memory: Syntax & SemanticsShared Memory [Tesla C1060]Shared Memory: Transaction RulesSlide 19Matrix Multiplication: Shared Memory UsageShared Memory ArchitectureBank Addressing ExamplesBank Addressing ExamplesShared Memory Bank ConflictsHow addresses map to banks on Tesla C1060Linear AddressingThe Math Beyond Bank ConflictsData types and bank conflictsStructs and Bank ConflictsCommon Array Bank Conflict Patterns 1DA Better Array Access PatternSlide 32ME964High Performance Computing for Engineering Applications“Once a new technology rolls over you, if you're not part of the steamroller, you're part of the road.” Stewart Brand© Dan Negrut, 2011ME964 UW-MadisonMemory Issues in CUDAFebruary 22, 2011Before We Get Started…Last timeWrapped up discussion about execution scheduling on the GPUDiscussed global memory access issues in CUDATodayExamples, global memory accesses Discuss shared memory accesses in CUDAA couple of comments on HW4Other issuesHW4 due tonight at 11:59 PMUse Learn@UW drop-box to submit homeworkHW5 posted, due on March 1, 11:59 PMPlease take a look at the latest version of the syllabus, has been updated recentlyThursday, Feb. 24TAs Toby Heyn and Arman Pazouki will provide an overview of two Midterm Project topics: Discrete Element Method (DEM) and Collision Detection, respectivly Wednesday, Feb 23: no office hours – I will be traveling (leaving on Wd at noon, returning Th evening)2Global Memory AccessCompute Capability 1.3A global memory request for a warp is split in two memory requests, one for each half-warpThe following 5-stage protocol is used to determine the memory transactions necessary to service all threads in a half-warpStage 1: Find the memory segment that contains the address requested by the lowest numbered active thread. The memory segment size depends on the size of the words accessed by the threads:32 bytes for 1-byte words,64 bytes for 2-byte words,128 bytes for 4-, 8- and 16-byte words.Stage 2: Find all other active threads whose requested address lies in the same segmentStage 3: Reduce the transaction size, if possible:If the transaction size is 128 bytes and only the lower or upper half is used, reduce the transaction size to 64 bytes;If the transaction size is 64 bytes (originally or after reduction from 128 bytes) and only the lower or upper half is used, reduce the transaction size to 32 bytes.Stage 4: Carry out the transaction and mark the serviced threads as inactive.Stage 5: Repeat until all threads in the half-warp are serviced.3Examples[Preamble]Look at an example that deals with 32 bit words (4 bytes)This is the case when handling integers or floatsVarious scenarios are going to be considered to illustrate how the two factors (layout of access & alignment) come into play when accessing global memoryNote that when handling 32 bit words, “segment size” represents 128 byte data chunks (all aligned at multiples of 128)In what follows, a different color is associated with each 128 byte memory segmentIn other words, two rows of the same color represent a 128-byte aligned segment4Example: Scenario 1Coalesced access in which all threads but one access the corresponding word in a segmentThis access pattern results in a single 64-byte transaction, indicated by the red rectangleNote that even though one word is not requested, all data in the segment are fetched. If accesses by threads were permuted within this segment, still one 64-byte transaction would be performed on Tesla C10605Example: Scenario 2Sequential threads in a half warp access memory that is sequential but not aligned with the segmentsGiven that the addresses fall within a 128-byte segment, a single 128-byte transaction is performed on Tesla C10606Example: Scenario 3A half warp accesses memory that is sequential but split across two 128-byte segments. Note that the request spans two different memory segmentsOn Tesla C1060, two transactions are performed: one 64-byte transaction and one 32-byte transaction result7Example: Scenario 4Strided access to global memory, as shown in the code snippet below:Although a stride of 2 above results in a single transaction, note that half the elements in the transaction are not used and represent wasted bandwidth8Example: Scenario 4[Cntd.]Strided access to global memory, as shown in the code snippet below:As the stride increases, the effective bandwidth decreases until the point where 16 transactions are issued for the 16 threads in a half warp, as shown in the plot9Compute Capability: 1.3Compute Capability: 1.0Looking Beyond Tesla C1060Tesla C1060 represents compute capability 1.3. How about other compute capabilities?Look at the same example as beforeAccessing floats or integers for global memory transactionsExample 1: access is aligned and sequential10Looking Beyond Tesla C1060[Cntd.]Example 2: Aligned but non-sequentialExample 3: Misaligned and sequential11Think about this…Say you use in your program complex data constructs that could be organized using C-structuresBased on what we learned today, how is it more advantageous to store data in global memory?Alternative A: as an array of structuresAlternative B: as a structure of arrays12Technical Specifications and Features [Short Detour]13Tesla C1060 (Newton)Fermi(Euler)Vector Reduction with Bank Conflicts(assume 1024 vector entries stored in shared memory; one block of 512 threads carries out the reduction)0 1 2 3 4 5 76 1098 11123Array elements (floats) 14[Short Detour]Discuss Shared Memory Issues15Shared Memory: Syntax & SemanticsYou can statically declare shared memory like in the code snippet below:NOTE: this makes the variable aTile visible to all threads in each block, and only to those threadsThe thread that executes the kernel above sees the aTile declaration and understands that all its brother-threads in the block are going to see it too. They will together share this variableThe same thread, when it sees the variable “row” it
View Full Document