DOC PREVIEW
UW-Madison ME 964 - Lecture 222

This preview shows page 1-2-15-16-31-32 out of 32 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 32 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 32 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 32 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 32 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 32 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 32 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 32 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Slide 1Before We Get Started…Global Memory Access Compute Capability 1.3Examples [Preamble]Example: Scenario 1Example: Scenario 2Example: Scenario 3Example: Scenario 4Example: Scenario 4 [Cntd.]Looking Beyond Tesla C1060Looking Beyond Tesla C1060 [Cntd.]Think about this…Technical Specifications and Features [Short Detour]Slide 14Discuss Shared Memory IssuesShared Memory: Syntax & SemanticsShared Memory [Tesla C1060]Shared Memory: Transaction RulesSlide 19Matrix Multiplication: Shared Memory UsageShared Memory ArchitectureBank Addressing ExamplesBank Addressing ExamplesShared Memory Bank ConflictsHow addresses map to banks on Tesla C1060Linear AddressingThe Math Beyond Bank ConflictsData types and bank conflictsStructs and Bank ConflictsCommon Array Bank Conflict Patterns 1DA Better Array Access PatternSlide 32ME964High Performance Computing for Engineering Applications“Once a new technology rolls over you, if you're not part of the steamroller, you're part of the road.” Stewart Brand© Dan Negrut, 2011ME964 UW-MadisonMemory Issues in CUDAFebruary 22, 2011Before We Get Started…Last timeWrapped up discussion about execution scheduling on the GPUDiscussed global memory access issues in CUDATodayExamples, global memory accesses Discuss shared memory accesses in CUDAA couple of comments on HW4Other issuesHW4 due tonight at 11:59 PMUse Learn@UW drop-box to submit homeworkHW5 posted, due on March 1, 11:59 PMPlease take a look at the latest version of the syllabus, has been updated recentlyThursday, Feb. 24TAs Toby Heyn and Arman Pazouki will provide an overview of two Midterm Project topics: Discrete Element Method (DEM) and Collision Detection, respectivly Wednesday, Feb 23: no office hours – I will be traveling (leaving on Wd at noon, returning Th evening)2Global Memory AccessCompute Capability 1.3A global memory request for a warp is split in two memory requests, one for each half-warpThe following 5-stage protocol is used to determine the memory transactions necessary to service all threads in a half-warpStage 1: Find the memory segment that contains the address requested by the lowest numbered active thread. The memory segment size depends on the size of the words accessed by the threads:32 bytes for 1-byte words,64 bytes for 2-byte words,128 bytes for 4-, 8- and 16-byte words.Stage 2: Find all other active threads whose requested address lies in the same segmentStage 3: Reduce the transaction size, if possible:If the transaction size is 128 bytes and only the lower or upper half is used, reduce the transaction size to 64 bytes;If the transaction size is 64 bytes (originally or after reduction from 128 bytes) and only the lower or upper half is used, reduce the transaction size to 32 bytes.Stage 4: Carry out the transaction and mark the serviced threads as inactive.Stage 5: Repeat until all threads in the half-warp are serviced.3Examples[Preamble]Look at an example that deals with 32 bit words (4 bytes)This is the case when handling integers or floatsVarious scenarios are going to be considered to illustrate how the two factors (layout of access & alignment) come into play when accessing global memoryNote that when handling 32 bit words, “segment size” represents 128 byte data chunks (all aligned at multiples of 128)In what follows, a different color is associated with each 128 byte memory segmentIn other words, two rows of the same color represent a 128-byte aligned segment4Example: Scenario 1Coalesced access in which all threads but one access the corresponding word in a segmentThis access pattern results in a single 64-byte transaction, indicated by the red rectangleNote that even though one word is not requested, all data in the segment are fetched. If accesses by threads were permuted within this segment, still one 64-byte transaction would be performed on Tesla C10605Example: Scenario 2Sequential threads in a half warp access memory that is sequential but not aligned with the segmentsGiven that the addresses fall within a 128-byte segment, a single 128-byte transaction is performed on Tesla C10606Example: Scenario 3A half warp accesses memory that is sequential but split across two 128-byte segments. Note that the request spans two different memory segmentsOn Tesla C1060, two transactions are performed: one 64-byte transaction and one 32-byte transaction result7Example: Scenario 4Strided access to global memory, as shown in the code snippet below:Although a stride of 2 above results in a single transaction, note that half the elements in the transaction are not used and represent wasted bandwidth8Example: Scenario 4[Cntd.]Strided access to global memory, as shown in the code snippet below:As the stride increases, the effective bandwidth decreases until the point where 16 transactions are issued for the 16 threads in a half warp, as shown in the plot9Compute Capability: 1.3Compute Capability: 1.0Looking Beyond Tesla C1060Tesla C1060 represents compute capability 1.3. How about other compute capabilities?Look at the same example as beforeAccessing floats or integers for global memory transactionsExample 1: access is aligned and sequential10Looking Beyond Tesla C1060[Cntd.]Example 2: Aligned but non-sequentialExample 3: Misaligned and sequential11Think about this…Say you use in your program complex data constructs that could be organized using C-structuresBased on what we learned today, how is it more advantageous to store data in global memory?Alternative A: as an array of structuresAlternative B: as a structure of arrays12Technical Specifications and Features [Short Detour]13Tesla C1060 (Newton)Fermi(Euler)Vector Reduction with Bank Conflicts(assume 1024 vector entries stored in shared memory; one block of 512 threads carries out the reduction)0 1 2 3 4 5 76 1098 11123Array elements (floats) 14[Short Detour]Discuss Shared Memory Issues15Shared Memory: Syntax & SemanticsYou can statically declare shared memory like in the code snippet below:NOTE: this makes the variable aTile visible to all threads in each block, and only to those threadsThe thread that executes the kernel above sees the aTile declaration and understands that all its brother-threads in the block are going to see it too. They will together share this variableThe same thread, when it sees the variable “row” it


View Full Document

UW-Madison ME 964 - Lecture 222

Documents in this Course
Load more
Download Lecture 222
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture 222 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture 222 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?