An Introduction to CUDA and Manycore Graphics ProcessorsOverviewMulticore and ManycoreMulticore & Manycore, cont.What is a core?SIMDSIMD: Neglected ParallelismA Brief History of x86 SIMDWhat to do with SIMD?The CUDA Programming ModelHierarchy of Concurrent ThreadsWhat is a CUDA Thread?What is a CUDA Thread Block?SynchronizationBlocks must be independentScalabilityHello World: Vector AdditionFlavors of parallelismMemory modelSlide 20Slide 21Using per-block shared memoryCUDA: Minimal extensions to C/C++CUDA: Features available on GPUCUDA: Runtime supportMapping CUDA to Nvidia GPUsMapping CUDA to a GPU, continuedOccupancy (Constants for GTX280)SIMD & Control FlowMemory, Memory, MemoryMemory is SIMD tooCoalescingData Structure PaddingSparse Matrix Vector MultiplyDiagonal representationOptimized Diagonal RepresentationSoA, AoSExperiences with CUDAImage ContoursgPb Algorithm: Current LeadergPb Computation OutlineTime breakdownTextons: KmeansGradientsGradients, continuedSpectral Graph PartitioningSpectral Graph Partitioning, cont.Accuracy & SummarySVM Training: Quadratic ProgrammingSMO AlgorithmTraining ResultsSVM ClassificationClassification ResultsCUDA SummaryUniversal Parallel Computing Research Universal Parallel Computing Research CenterCenterUniversity of California, BerkeleyUniversity of California, Berkeley2/54Terminology: Multicore, Manycore, SIMDThe CUDA Programming modelMapping CUDA to Nvidia GPUsExperiences with CUDA3/54Multicore: yoke of oxenEach core optimized for executing a single threadManycore: flock of chickensCores optimized for aggregate throughput, deemphasizing individual performanceMulticore Manycore4/54Specifications Core i7 960 GTX285Processing Elements4 cores, 4 way [email protected] GHz30 cores, 8 way [email protected] GHzResident Strands/Threads (max)4 cores, 2 threads, 4 way SIMD:32 strands30 cores, 32 SIMD vectors, 32 way SIMD:30720 threadsSP GFLOP/s 102 1080Memory Bandwidth 25.6 GB/s 159 GB/sRegister File - 1.875 MBLocal Store - 480 kBCore i7 (45nm)GTX285 (55nm)5/54Is a core an ALU? ATI: We have 800 streaming processors!!▪Actually, we have 5 way VLIW * 16 way SIMD * 10 “SIMD cores”Is a core a SIMD vector unit?Nvidia: We have 240 streaming processors!!▪Actually, we have 8 way SIMD * 30 “multiprocessors” ▪To match ATI, they could count another factor of 2 for dual issue In this lecture, we’re using core consistent with the CPU worldSuperscalar, VLIW, SIMD are part of a core’s architecture, not the number of cores6/54aSingle Instruction Multiple Data architectures make use of data parallelismSIMD can be area and power efficientAmortize control overhead over SIMD widthParallelism exposed to programmer & compilerbca2a1b2b1c2c1++SISDSIMDwidth=27/54It is difficult for a compiler to exploit SIMDHow do you deal with sparse data & branches?Many languages (like C) are difficult to vectorizeFortran is somewhat betterMost common solution:Either forget about SIMD▪Pray the autovectorizer likes youOr instantiate intrinsics (assembly language)Requires a new code version for every SIMD extension8/549/54Neglecting SIMD in the future will be more expensiveAVX: 8 way SIMD, Larrabee: 16 way SIMD, Nvidia: 32 way SIMD, ATI: 64 way SIMDThis problem composes with thread level parallelismWe need a programming model which addresses both problems4 way SIMD (SSE) 16 way SIMD (LRB)10/54CUDA is a recent programming model, designed forManycore architecturesWide SIMD parallelismScalabilityCUDA provides:A thread abstraction to deal with SIMDSynchronization & data sharing between small groups of threadsCUDA programs are written in C + extensionsOpenCL is inspired by CUDA, but HW & SW vendor neutralProgramming model essentially identical11/54Parallel kernels composed of many threadsall threads execute the same sequential programThreads are grouped into thread blocksthreads in the same block can cooperateThreads/blocks have unique IDsThread tt0 t1 … tNBlock b12/54Independent thread of executionhas its own PC, variables (registers), processor state, etc.no implication about how threads are scheduledCUDA threads might be physicalphysical threadsas on NVIDIA GPUsCUDA threads might be virtualvirtual threadsmight pick 1 block = 1 physical thread on multicore CPU13/54Thread block = virtualized multiprocessorvirtualized multiprocessorfreely choose processors to fit datafreely customize for each kernel launchThread block = a (data) parallel taskparallel taskall blocks in kernel have the same entry pointbut may execute any code they wantThread blocks of kernel must be independentindependent tasksprogram valid for any interleaving of block executions14/54Threads within a block may synchronize with barriers… Step 1 …__syncthreads();… Step 2 …Blocks coordinate via atomic memory operationse.g., increment shared queue pointer with atomicInc()Implicit barrier between dependent kernelsvec_minus<<<nblocks, blksize>>>(a, b, c);vec_dot<<<nblocks, blksize>>>(c, c);15/54Any possible interleaving of blocks should be validpresumed to run to completion without pre-emptioncan run in any ordercan run concurrently OR sequentiallyBlocks may coordinate but not synchronizeshared queue pointer: OKOKshared lock: BAD BAD … can easily deadlockIndependence requirement gives scalabilityscalability16/54Manycore chips exist in a diverse set of configurationsNumber of coresCUDA allows one binary to target all these chipsThread blocks bring scalability!17/54//Compute vector sum C=A+B//Each thread performs one pairwise addition__global__ void vecAdd(float* a, float* b, float* c) {int i = blockIdx.x * blockDim.x + threadIdx.x;c[i] = a[i] + b[i];}int main() {//Run N/256 blocks of 256 threads eachvecAdd<<<N/256, 256>>>(d_a, d_b, d_c);}18/54Thread parallelismeach thread is an independent thread of executionData parallelismacross threads in a blockacross blocks in a kernelTask parallelismdifferent blocks are independentindependent kernels19/54ThreadPer-threadLocal MemoryBlockPer-blockShared Memory20/54Kernel Kernel 00Kernel Kernel 11SequentiSequentialalKernelsKernels……Per Device Global MemoryPer Device Global Memory21/54Host MemoryHost MemoryDevice 0 MemoryDevice 0 MemoryDevice 1 MemoryDevice 1 MemorycudaMemcpy()22/54Variables shared across block__shared____shared__ int
View Full Document