Spring 2011 Prof. Hyesoon KimSource: AMD Accelerated Parallel Processing OpenCL Programming GuideSource: AMD Accelerated Parallel Processing OpenCL Programming GuideOpenCL : CUDAWork-item: threadWork-group: Block AMD : NVIDIA Wavefront: warpSource: AMD Accelerated Parallel Processing OpenCL Programming Guide• Memory per compute unit– Local data store (on-chip)– Registers– L1 cache (8KB for 5870) per compute unit• L2 Cache shared between compute units (512KB for 5870)• Fast path for only 32 bit operations { no atomic}• Complete path for atomics and < 32bit operationsSIMD Engine LDS, RegistersCompute Unit to Memory X-barLDSL1 CacheL2 Cache Write CacheAtomic PathPerhaad Mistry & Dana Schaa, Northeastern UnivComputer Architecture Research Lab, with Ben Gaster, AMD © 2011SIMT Execution Model123456789AddAddAddAddAddAddAddAddAddAddAddAddAddAddAddAddAddAddAddAddAddAddAddAddAddAddAddAddAddAddAddAddMulMulMulMulMulMulMulMulMulMulMulMulMulMulMulMulMulMulMulMulMulMulMulMulMulMulMulMulMulMulMulMul…Wavefront…CycleSIMD Width SIMD execution can be combined with pipelining ALUs all execute the same instruction Pipelining is used to break instruction into phases When first instruction completes (4 cycles here), the next instruction is ready to execute6Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011AMD Memory Model in OpenCL Subset of hardware memory exposed in OpenCL Local Data Share (LDS) exposed as local memory Share data between items of a work group designed to increase performance High Bandwidth access per SIMD Engine Private memory utilizes registers per work item Constant Memory __constant tags utilize L1 cache.Global MemoryPrivate MemoryWorkitem 1Private MemoryWorkitem 1Compute Unit 1Local MemoryGlobal / Constant Memory Data CacheLocal MemoryPrivate MemoryWorkitem 1Private MemoryWorkitem 1Compute Unit NCompute DeviceCompute Device Memory7Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011AMD Constant Memory Usage Constant Memory declarations for AMD GPUs only beneficial for following access patterns Direct-Addressing Patterns: For non array constant values where the address is known initially Same Index Patterns: When all work-items reference the same constant address Globally scoped constant arrays: Arrays that are initialized, globally scoped can use the cache if less than 16KB Cases where each work item accesses different indices, are not cached and deliver the same performance as a global memory readSource: AMD Accelerated Parallel Processing OpenCL Programming Guide8Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011• For the AMD 5870 GPU, memory accesses of 16 consecutive threads are evaluated together and can be coalesced to fully utilize the bus– This unit is called a quarter-wavefront• Both NVIDIA and ATI use 16 consecutive threads as the minimum memory traffics • 16*4B = 64B 16*8B = 128BCoalescing Memory Accesses Global memory performance for a simple data copying kernel of entirely coalesced and entirely non-coalesced accesses on an NVIDIA GTX 28502040608010012014014163264128256Bandwidth (GB/s)Data Size (MB)CoalescedUncoalesced10Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011Memory Banks Memory is made up of banks Memory banks are the hardware units that actually store data The memory banks targeted by a memory access depend on the address of the data to be read/written Note that on current GPUs, there are more memory banks than can be addressed at once by the global memory bus, so it is possible for different accesses to target different banks Bank response time, not access requests, is the bottleneck Successive data are stored in successive banks (strides of 32-bit words on GPUs) so that a group of threads accessing successive elements will produce no bank conflicts11Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011Bank Conflicts – Local Memory Bank conflicts have the largest negative effect on local memory operations Local memory does not require that accesses are to sequentially increasing elements Accesses from successive threads should target different memory banks Threads accessing sequentially increasing data will fall into this category12Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011Bank Conflicts – Local Memory On AMD, a wavefront that generates bank conflicts stalls until all local memory operations complete The hardware does not hide the stall by switching to another wavefront The following examples show local memory access patterns and whether conflicts are generated For readability, only 8 memory banks are shown13Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011Bank Conflicts – Local Memory If there are no bank conflicts, each bank can return an element without any delays Both of the following patterns will complete without stalls on current GPU hardware01234567Memory Bank01234567Thread01234567Memory Bank01234567Thread14Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011Bank Conflicts – Local Memory If multiple accesses occur to the same bank, then the bank with the most conflicts will determine the latency The following pattern will take 3 times the access latency to complete01234567Memory Bank01234567Thread21311Conflicts15Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011Bank Conflicts – Local Memory If all accesses are to the same address, then the bank can perform a broadcast and no delay is incurred The following will only take one access to complete assuming the same data element is accessed01234567Memory Bank01234567Thread16Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011Bank Conflicts – Global Memory Bank conflicts in global memory rely on the same principles, however the global memory bus makes the impact of conflicts more subtle Since accessing data in global memory requires that an
View Full Document