An#Introduction#to#CUDA/OpenCL#and#Manycore#Graphics#Processors#Bryan&Catanzaro,&NVIDIA&Research&&2/76&Overview#¡ Terminology:&Multicore,&Manycore,&SIMD&¡ The&CUDA&and&OpenCL&programming&models&¡ Understanding&how&CUDA&maps&to&NVIDIA&GPUs&¡ Thrust&&3/76&Manycore#GPU#Scalable&Parallel&Processing&Multicore#CPU#Fast&Serial&Processing&Heterogeneous#Parallel#Computing#4/76&Multicore#and#Manycore#¡ Multicore:&yoke&of&oxen&§ Each&core&optimized&for&executing&a&single&thread&¡ Manycore:&flock&of&chickens&§ Cores&optimized&for&aggregate&throughput,&deemphasizing&individual&performance&¡ (apologies*to*Seymour*Cray)*Multicore& Manycore&5/76&Multicore#&#Manycore,#cont.#Specifica(ons, Westmere1EP,,Fermi,,(Tesla,C2050),Processing*Elements*6*cores,*2*issue,**4*way*SIMD*@3.46*GHz*14*SMs,*2*issue,*16*way*SIMD*@1.15*GHz*Resident*Strands/Threads*(max)*6*cores,*2*threads,*4*way*SIMD:**48*strands*14*SMs,*48*SIMD*vectors,*32*way*SIMD:*21504*threads*SP*GFLOP/s* 166* 1030*Memory*Bandwidth* 32*GB/s* 144*GB/s*Register*File* 6*kB*(?)* 1.75*MB*Local*Store/L1*Cache* 192*kB* 896*kB*L2*Cache* 1536*kB* 0.75*MB*L3*Cache* 12*MB* X*WestmereWEP& (32nm)&Fermi&(40nm)&6/76&Why#Heterogeneity?#¡ Different&goals&produce&different&designs&§ Manycore&assumes&work&load&is&highly¶llel&§ Multicore&must&be&good&at&everything,¶llel&or¬&¡ Multicore:&minimize&latency&experienced&by&1&thread&§ lots&of&big&onWchip&caches&§ extremely&sophisticated&control&¡ Manycore:&maximize&throughput&of&all&threads&§ lots&of&big&ALUs&§ multithreading&can&hide&latency&…&so&skip&the&big&caches&§ simpler&control,&cost&amortized&over&ALUs&via&SIMD&7/76&a&SIMD#¡ Single&Instruction&Multiple&Data&architectures&make&use&of&data¶llelism&¡ We&care&about&SIMD&because&of&area&and&power&efficiency&concerns&§ Amortize&control&overhead&over&SIMD&width&¡ Parallelism&exposed&to&programmer&&&compiler&b&c&a2&a1&b2&b1&c2&c1&+&+&SISD&SIMD&width=2&8/76&SIMD:#Neglected#Parallelism#¡ OpenMP&/&Pthreads&/&MPI&all&neglect&SIMD¶llelism&¡ Because&it&is&difficult&for&a&compiler&to&exploit&SIMD&¡ How&do&you&deal&with&sparse&data&&&branches?&§ Many&languages&(like&C)&are&difficult&to&vectorize&&¡ Most&common&solution:&§ Either&forget&about&SIMD&▪ Pray&the&autovectorizer&likes&you&§ Or&instantiate&intrinsics&(assembly&language)&§ Requires&a&new&code&version&for&every&SIMD&extension&9/76&A#Brief#History#of#x86#SIMD#Extensions#MMX&SSE&SSE2&SSE3&SSSE3&SSE4.1&SSE4.2&AVX&AVX+FMA&AVX2&8*8&bit&Int&4*32&bit&FP&2*64&bit&FP&Horizontal&ops&8*32&bit&FP&3&operand&256&bit&Int&ops,&Gather&LRB&512&bit&3dNow!&SSE4.A&SSE5&10/76&What#to#do#with#SIMD?#¡ Neglecting&SIMD&is&becoming&more&expensive&§ AVX:&8&way&SIMD,&Larrabee:&16&way&SIMD,&&Nvidia:&32&way&SIMD,&ATI:&64&way&SIMD&¡ This&problem&composes&with&thread&level¶llelism&¡ We&need&a&programming&model&which&addresses&both&problems&&&4&way&SIMD&(SSE)& 16&way&SIMD&(LRB)&11/76&The#CUDA#Programming#Model#¡ CUDA&is&a&programming&model&designed&for:&§ Manycore&architectures&§ Wide&SIMD¶llelism&§ Scalability&&¡ CUDA&provides:&§ A&thread&abstraction&to&deal&with&SIMD&§ Synchronization&&&data&sharing&between&small&groups&of&threads&&¡ CUDA&programs&are&written&in&C++&with&minimal&extensions&&¡ OpenCL&is&inspired&by&CUDA,&but&HW&& &SW&vendor&neutral&§ Similar&programming&model,&C&only &for&device&code&12/76&Hierarchy#of#Concurrent#Threads#¡ Parallel&kernels&composed&of&many&threads&§ all&threads&execute&the&same&sequential&program&¡ Threads&are&grouped&into&thread&blocks&§ threads&in&the&same&block&can&cooperate&¡ Threads/blocks&have&unique&IDs&Thread&t&t0#t1#…#tN#Block&b&13/76&What#is#a#CUDA#Thread?#¡ Independent&thread&of&execution&§ has&its&own&program&counter,&variables&(registers),&&processor&state,&etc.&§ no&implication&about&how&threads&are&scheduled&¡ CUDA&threads&might&be&physical&threads&§ as&mapped&onto&NVIDIA&GPUs&¡ CUDA&threads&might&be&virtual&threads&§ might&pick&1&block&=&1&physical&thread&on&multicore&CPU&&14/76&What#is#a#CUDA#Thread#Block?#¡ Thread&block&=&a&(data)¶llel&task&§ all&blocks&in&kernel&have&the&same&entry&point&§ but&may&execute&any&code&th ey&want&&&¡ Thread&blocks&of&kernel&must&be&independent&tasks&§ program&valid&for&any'interleaving&of&block&executions&15/76&CUDA#Supports:#¡ Thread¶llelism&§ each&thread&is&an&independent&thread&of&execution&¡ Data¶llelism&§ across&threads&in&a&block&§ across&blocks&in&a&kernel&¡ Task¶llelism&§ different&blocks&are&independent&§ independent&kernels&executing&in&separate&streams&16/76&Synchronization#¡ Threads&within&a&block&may&synchronize&with&barriers&#…#Step#1#…#__syncthreads();#…#Step#2#…#¡ Blocks&coordinate&via&atomic&memory&operations&§ e.g.,&increment&shared&queue&pointer&with&atomicInc()#¡ Implicit&barrier&between&dependent&kernels&&vec_minus<<<nblocks,#blksize>>>(a,#b,#c);##vec_dot<<<nblocks,#blksize>>>(c,#c);#17/76&Blocks#must#be#independent#¡ Any&possible&interleaving&of&blocks&should&be&valid&§ presumed&to&run&to&completion&without&preWemption&§ can&run&in&any&order&§ can&run&concurrently&OR&sequentially&¡ Blocks&may&coordinate&but¬&synchronize&§ shared&queue&pointer:&OK&§ shared&lock:&BAD&…&can&easily&deadlock&¡ Independence&requirement&gives&scalability&18/76&Scalability#¡ Manycore&chips&exist&in&a&diverse&set&of&configurations&0&5&10&15&20&25&30&35&8300GS& 9400M& 8800GTX& GTX285&Number&of&SMs&¡ CUDA&allows&one&binary&to&target&all&these&chips&¡ Thread&blocks&bring&scalability!&19/76&Hello#World:#Vector#Addition#//Compute#vector#sum#C=A+B#//Each#thread#performs#one#pairwise#addition#__global__#void#vecAdd(float*#a,#float*#b,#float*#c)#{##int#i#=#blockIdx.x#*#blockDim.x#+#threadIdx.x;##c[i]#=#a[i]#+#b[i];#}##int#main()#{##//Run#N/256#blocks#of#256#threads#each##vecAdd<<<N/256,#256>>>(d_a,#d_b,#d_c);#}#20/76&Memory#model#Thread&PerWthread&Local&Memory&Block
View Full Document