DOC PREVIEW
GT CS 4803 - LECTURE NOTES
School name Georgia Tech
Pages 26

This preview shows page 1-2-3-24-25-26 out of 26 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 26 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 26 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 26 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 26 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 26 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 26 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 26 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Spring 2011 Prof. Hyesoon KimSource: AMD Accelerated Parallel Processing OpenCL Programming GuideSource: AMD Accelerated Parallel Processing OpenCL Programming GuideOpenCL : CUDAWork-item: threadWork-group: Block AMD : NVIDIA Wavefront: warpSource: AMD Accelerated Parallel Processing OpenCL Programming Guide• Memory per compute unit– Local data store (on-chip)– Registers– L1 cache (8KB for 5870) per compute unit• L2 Cache shared between compute units (512KB for 5870)• Fast path for only 32 bit operations { no atomic}• Complete path for atomics and < 32bit operationsSIMD Engine LDS, RegistersCompute Unit to Memory X-barLDSL1 CacheL2 Cache Write CacheAtomic PathPerhaad Mistry & Dana Schaa, Northeastern UnivComputer Architecture Research Lab, with Ben Gaster, AMD © 2011SIMT Execution Model123456789AddAddAddAddAddAddAddAddAddAddAddAddAddAddAddAddAddAddAddAddAddAddAddAddAddAddAddAddAddAddAddAddMulMulMulMulMulMulMulMulMulMulMulMulMulMulMulMulMulMulMulMulMulMulMulMulMulMulMulMulMulMulMulMul…Wavefront…CycleSIMD Width SIMD execution can be combined with pipelining ALUs all execute the same instruction Pipelining is used to break instruction into phases When first instruction completes (4 cycles here), the next instruction is ready to execute6Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011AMD Memory Model in OpenCL Subset of hardware memory exposed in OpenCL Local Data Share (LDS) exposed as local memory Share data between items of a work group designed to increase performance High Bandwidth access per SIMD Engine Private memory utilizes registers per work item Constant Memory __constant tags utilize L1 cache.Global MemoryPrivate MemoryWorkitem 1Private MemoryWorkitem 1Compute Unit 1Local MemoryGlobal / Constant Memory Data CacheLocal MemoryPrivate MemoryWorkitem 1Private MemoryWorkitem 1Compute Unit NCompute DeviceCompute Device Memory7Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011AMD Constant Memory Usage Constant Memory declarations for AMD GPUs only beneficial for following access patterns Direct-Addressing Patterns: For non array constant values where the address is known initially Same Index Patterns: When all work-items reference the same constant address Globally scoped constant arrays: Arrays that are initialized, globally scoped can use the cache if less than 16KB Cases where each work item accesses different indices, are not cached and deliver the same performance as a global memory readSource: AMD Accelerated Parallel Processing OpenCL Programming Guide8Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011• For the AMD 5870 GPU, memory accesses of 16 consecutive threads are evaluated together and can be coalesced to fully utilize the bus– This unit is called a quarter-wavefront• Both NVIDIA and ATI use 16 consecutive threads as the minimum memory traffics • 16*4B = 64B 16*8B = 128BCoalescing Memory Accesses Global memory performance for a simple data copying kernel of entirely coalesced and entirely non-coalesced accesses on an NVIDIA GTX 28502040608010012014014163264128256Bandwidth (GB/s)Data Size (MB)CoalescedUncoalesced10Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011Memory Banks Memory is made up of banks  Memory banks are the hardware units that actually store data The memory banks targeted by a memory access depend on the address of the data to be read/written Note that on current GPUs, there are more memory banks than can be addressed at once by the global memory bus, so it is possible for different accesses to target different banks Bank response time, not access requests, is the bottleneck Successive data are stored in successive banks (strides of 32-bit words on GPUs) so that a group of threads accessing successive elements will produce no bank conflicts11Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011Bank Conflicts – Local Memory Bank conflicts have the largest negative effect on local memory operations Local memory does not require that accesses are to sequentially increasing elements Accesses from successive threads should target different memory banks Threads accessing sequentially increasing data will fall into this category12Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011Bank Conflicts – Local Memory On AMD, a wavefront that generates bank conflicts stalls until all local memory operations complete The hardware does not hide the stall by switching to another wavefront The following examples show local memory access patterns and whether conflicts are generated For readability, only 8 memory banks are shown13Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011Bank Conflicts – Local Memory If there are no bank conflicts, each bank can return an element without any delays Both of the following patterns will complete without stalls on current GPU hardware01234567Memory Bank01234567Thread01234567Memory Bank01234567Thread14Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011Bank Conflicts – Local Memory If multiple accesses occur to the same bank, then the bank with the most conflicts will determine the latency The following pattern will take 3 times the access latency to complete01234567Memory Bank01234567Thread21311Conflicts15Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011Bank Conflicts – Local Memory If all accesses are to the same address, then the bank can perform a broadcast and no delay is incurred The following will only take one access to complete assuming the same data element is accessed01234567Memory Bank01234567Thread16Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011Bank Conflicts – Global Memory Bank conflicts in global memory rely on the same principles, however the global memory bus makes the impact of conflicts more subtle  Since accessing data in global memory requires that an


View Full Document

GT CS 4803 - LECTURE NOTES

Download LECTURE NOTES
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view LECTURE NOTES and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view LECTURE NOTES 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?