Berkeley COMPSCI C267 - An Introduction to CUDA and Manycore Graphics Processors - D2187077

Home> Schools> University of California, Berkeley> Computer Science (COMPSCI) > COMPSCI C267> An Introduction to CUDA and Manycore Graphics Processors

DOC PREVIEW

Berkeley COMPSCI C267 - An Introduction to CUDA and Manycore Graphics Processors

School name University of California, Berkeley

Course Compsci C267- Applications of Parallel Computers

Pages 54

This preview shows page 1-2-3-4-25-26-27-51-52-53-54 out of 54 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 54 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 54 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 54 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 54 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 54 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 54 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 54 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 54 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 54 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 54 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 54 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 54 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

An Introduction to CUDA and Manycore Graphics ProcessorsOverviewMulticore and ManycoreMulticore & Manycore, cont.What is a core?SIMDSIMD: Neglected ParallelismA Brief History of x86 SIMDWhat to do with SIMD?The CUDA Programming ModelHierarchy of Concurrent ThreadsWhat is a CUDA Thread?What is a CUDA Thread Block?SynchronizationBlocks must be independentScalabilityHello World: Vector AdditionFlavors of parallelismMemory modelSlide 20Slide 21Using per-block shared memoryCUDA: Minimal extensions to C/C++CUDA: Features available on GPUCUDA: Runtime supportMapping CUDA to Nvidia GPUsMapping CUDA to a GPU, continuedOccupancy (Constants for GTX280)SIMD & Control FlowMemory, Memory, MemoryMemory is SIMD tooCoalescingData Structure PaddingSparse Matrix Vector MultiplyDiagonal representationOptimized Diagonal RepresentationSoA, AoSExperiences with CUDAImage ContoursgPb Algorithm: Current LeadergPb Computation OutlineTime breakdownTextons: KmeansGradientsGradients, continuedSpectral Graph PartitioningSpectral Graph Partitioning, cont.Accuracy & SummarySVM Training: Quadratic ProgrammingSMO AlgorithmTraining ResultsSVM ClassificationClassification ResultsCUDA SummaryUniversal Parallel Computing Research Universal Parallel Computing Research CenterCenterUniversity of California, BerkeleyUniversity of California, Berkeley2/54Terminology: Multicore, Manycore, SIMDThe CUDA Programming modelMapping CUDA to Nvidia GPUsExperiences with CUDA3/54Multicore: yoke of oxenEach core optimized for executing a single threadManycore: flock of chickensCores optimized for aggregate throughput, deemphasizing individual performanceMulticore Manycore4/54Specifications Core i7 960 GTX285Processing Elements4 cores, 4 way [email protected] GHz30 cores, 8 way [email protected] GHzResident Strands/Threads (max)4 cores, 2 threads, 4 way SIMD:32 strands30 cores, 32 SIMD vectors, 32 way SIMD:30720 threadsSP GFLOP/s 102 1080Memory Bandwidth 25.6 GB/s 159 GB/sRegister File - 1.875 MBLocal Store - 480 kBCore i7 (45nm)GTX285 (55nm)5/54Is a core an ALU? ATI: We have 800 streaming processors!!▪Actually, we have 5 way VLIW * 16 way SIMD * 10 “SIMD cores”Is a core a SIMD vector unit?Nvidia: We have 240 streaming processors!!▪Actually, we have 8 way SIMD * 30 “multiprocessors” ▪To match ATI, they could count another factor of 2 for dual issue In this lecture, we’re using core consistent with the CPU worldSuperscalar, VLIW, SIMD are part of a core’s architecture, not the number of cores6/54aSingle Instruction Multiple Data architectures make use of data parallelismSIMD can be area and power efficientAmortize control overhead over SIMD widthParallelism exposed to programmer & compilerbca2a1b2b1c2c1++SISDSIMDwidth=27/54It is difficult for a compiler to exploit SIMDHow do you deal with sparse data & branches?Many languages (like C) are difficult to vectorizeFortran is somewhat betterMost common solution:Either forget about SIMD▪Pray the autovectorizer likes youOr instantiate intrinsics (assembly language)Requires a new code version for every SIMD extension8/549/54Neglecting SIMD in the future will be more expensiveAVX: 8 way SIMD, Larrabee: 16 way SIMD, Nvidia: 32 way SIMD, ATI: 64 way SIMDThis problem composes with thread level parallelismWe need a programming model which addresses both problems4 way SIMD (SSE) 16 way SIMD (LRB)10/54CUDA is a recent programming model, designed forManycore architecturesWide SIMD parallelismScalabilityCUDA provides:A thread abstraction to deal with SIMDSynchronization & data sharing between small groups of threadsCUDA programs are written in C + extensionsOpenCL is inspired by CUDA, but HW & SW vendor neutralProgramming model essentially identical11/54Parallel kernels composed of many threadsall threads execute the same sequential programThreads are grouped into thread blocksthreads in the same block can cooperateThreads/blocks have unique IDsThread tt0 t1 … tNBlock b12/54Independent thread of executionhas its own PC, variables (registers), processor state, etc.no implication about how threads are scheduledCUDA threads might be physicalphysical threadsas on NVIDIA GPUsCUDA threads might be virtualvirtual threadsmight pick 1 block = 1 physical thread on multicore CPU13/54Thread block = virtualized multiprocessorvirtualized multiprocessorfreely choose processors to fit datafreely customize for each kernel launchThread block = a (data) parallel taskparallel taskall blocks in kernel have the same entry pointbut may execute any code they wantThread blocks of kernel must be independentindependent tasksprogram valid for any interleaving of block executions14/54Threads within a block may synchronize with barriers… Step 1 …__syncthreads();… Step 2 …Blocks coordinate via atomic memory operationse.g., increment shared queue pointer with atomicInc()Implicit barrier between dependent kernelsvec_minus<<<nblocks, blksize>>>(a, b, c);vec_dot<<<nblocks, blksize>>>(c, c);15/54Any possible interleaving of blocks should be validpresumed to run to completion without pre-emptioncan run in any ordercan run concurrently OR sequentiallyBlocks may coordinate but not synchronizeshared queue pointer: OKOKshared lock: BAD BAD … can easily deadlockIndependence requirement gives scalabilityscalability16/54Manycore chips exist in a diverse set of configurationsNumber of coresCUDA allows one binary to target all these chipsThread blocks bring scalability!17/54//Compute vector sum C=A+B//Each thread performs one pairwise addition__global__ void vecAdd(float* a, float* b, float* c) {int i = blockIdx.x * blockDim.x + threadIdx.x;c[i] = a[i] + b[i];}int main() {//Run N/256 blocks of 256 threads eachvecAdd<<<N/256, 256>>>(d_a, d_b, d_c);}18/54Thread parallelismeach thread is an independent thread of executionData parallelismacross threads in a blockacross blocks in a kernelTask parallelismdifferent blocks are independentindependent kernels19/54ThreadPer-threadLocal MemoryBlockPer-blockShared Memory20/54Kernel Kernel 00Kernel Kernel 11SequentiSequentialalKernelsKernels……Per Device Global MemoryPer Device Global Memory21/54Host MemoryHost MemoryDevice 0 MemoryDevice 0 MemoryDevice 1 MemoryDevice 1 MemorycudaMemcpy()22/54Variables shared across block__shared____shared__ int

View Full Document

Berkeley COMPSCI C267 - An Introduction to CUDA and Manycore Graphics Processors

Sign up for free to view:

This document and 3 million+ documents and flashcards
High quality study guides, lecture notes, practice exams
Course Packets handpicked by editors offering a comprehensive review of your courses
Better Grades Guaranteed


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-3-4-25-26-27-51-52-53-54 out of 54 pages.

Berkeley COMPSCI C267 - An Introduction to CUDA and Manycore Graphics Processors

Sign up for free to view:

Please select your school