ME964 High Performance Computing for Engineering ApplicationsBefore we get started…The CUDA Access SituationHW2: A word on getting started with CUDAExecution Configuration: Grids and BlocksBlock and Thread IDsCUDA Device Memory Space OverviewGlobal, Constant, and Texture Memories (Long Latency Accesses by Host)End: Memory Layout on the GPU Begin: CUDA APIWhat is an API?OverviewTalking about the API: The CUDA Software StackCUDA Highlights: Easy and LightweightCUDA Device Memory AllocationA Small Detour: A Matrix Data TypeCUDA Device Memory Allocation (cont.)CUDA Host-Device Data TransferCUDA Host-Device Data Transfer (cont.)CUDA Function DeclarationsCUDA Function Declarations (cont.)Calling a Kernel Function, and the Concept of Execution ConfigurationME964High Performance Computing for Engineering ApplicationsCUDA Memory Model& CUDA APISept. 16, 2008Before we get started…Last Time Traced back the evolution of the GPUGPGPU and the CUDA step forwardCUDA-related nomenclatureMemory layout of typical NVIDIA GPUTodayThe CUDA APIStart discussing CUDA programming modelA look at a matrix multiplication example2The CUDA Access SituationYou can install CUDA on your computer even if you don’t have a GPU cardYou can do 95% of your HW2 without needing a GPU Cards to be installed this afternoon in 1235MECAE doesn’t want to made Visual Studio 2005 available (they use Visual Studio 2008)I’m looking into opening up my lab in case 1235ME doesn’t prove to be an optionLinux accounts available at UIUC on GPU based supercomputerSee Forum posting about details3After you unzip the emailed assignment file, you should get a collection of files like below:HW2: A word on getting started with CUDADouble click hello world.sln to get startedThe directory Linux contains a makefile and required files to get you going with this OSNOTE: readme.doc contains the text of the assignment4Execution Configuration: Grids and BlocksA kernel is executed as a grid of blocks of threadsAll threads share global memory spaceA block [of threads] is a batch of threads that can cooperate with each other by:Synchronizing their executionFor hazard-free shared memory accessesEfficiently sharing data through a low latency shared memoryThreads from two different blocks cannot cooperate!!!This has important software design implicationsHostKernel 1Kernel 2DeviceGrid 1Block(0, 0)Block(1, 0)Block(2, 0)Block(0, 1)Block(1, 1)Block(2, 1)Grid 2Block (1, 1)Thread(0, 1)Thread(1, 1)Thread(2, 1)Thread(3, 1)Thread(4, 1)Thread(0, 2)Thread(1, 2)Thread(2, 2)Thread(3, 2)Thread(4, 2)Thread(0, 0)Thread(1, 0)Thread(2, 0)Thread(3, 0)Thread(4, 0)Courtesy: NDVIA5HK-UIUCBlock and Thread IDsThreads and blocks have IDsSo each thread can decide what data to work onBlock ID: 1D or 2DThread ID: 1D, 2D, or 3D Why this 2D and 3D layout?Simplifies memoryaddressing when processingmultidimensional dataImage processingSolving PDEs on subdomains…DeviceGrid 1Block(0, 0)Block(1, 0)Block(2, 0)Block(0, 1)Block(1, 1)Block(2, 1)Block (1, 1)Thread(0, 1)Thread(1, 1)Thread(2, 1)Thread(3, 1)Thread(4, 1)Thread(0, 2)Thread(1, 2)Thread(2, 2)Thread(3, 2)Thread(4, 2)Thread(0, 0)Thread(1, 0)Thread(2, 0)Thread(3, 0)Thread(4, 0)Courtesy: NDVIA6HK-UIUCCUDA Device Memory Space OverviewEach thread can:R/W per-thread registersR/W per-thread local memoryR/W per-block shared memoryR/W per-grid global memoryRead only per-grid constant memoryRead only per-grid texture memory(Device) GridConstantMemoryTextureMemoryGlobalMemoryBlock (0, 0)Shared MemoryLocalMemoryThread (0, 0)RegistersLocalMemoryThread (1, 0)RegistersBlock (1, 0)Shared MemoryLocalMemoryThread (0, 0)RegistersLocalMemoryThread (1, 0)RegistersHostThe host can R/W global, constant, and texture memory7IMPORTANT NOTE: Global, constant, and texture memory spaces are persistent across kernels called by the same host application.HK-UIUCGlobal, Constant, and Texture Memories(Long Latency Accesses by Host)Global memoryMain means of communicating R/W Data between host and deviceContents visible to all threadsTexture and Constant MemoriesConstants initialized by host Contents visible to all threads(Device) GridConstantMemoryTextureMemoryGlobalMemoryBlock (0, 0)Shared MemoryLocalMemoryThread (0, 0)RegistersLocalMemoryThread (1, 0)RegistersBlock (1, 0)Shared MemoryLocalMemoryThread (0, 0)RegistersLocalMemoryThread (1, 0)RegistersHostCourtesy: NDVIA8HK-UIUCNOTE: We will not emphasize texture memory in this class.9End: Memory Layout on the GPUBegin: CUDA APIWhat is an API?Application Programming Interface (API)A set of functions, procedures or classes that an operating system, library, or service provides to support requests made by computer programs (from Wikipedia)Example: OpenGL, a graphics library, has its own API that allows one to draw a line, rotate it, resize it, etc.Cooked up analogy (for the mechanical engineer)Think of a car, you can say it has a certain Device Operating Interface (DOI):A series of pedals, gauges, handwheel, etc. This would be its DOIIn this context, CUDA is the API that enables you to tap into the computational resources of the NVIDIA GPUThis is what replaced the old GPGPU way of programming the hardware10OverviewCUDA programming model – basic concepts and data typesCUDA application programming interface - basicSimple example to illustrate basic concepts and functionality11HK-UIUCPerformance features will be covered laterTalking about the API:The CUDA Software StackImage at right indicates where the API fits in the picture12An API layer is indicated by a thick red line:Dealing with the CUDA Driver API is tediousWe’ll only discuss the CUDA Runtime API, which handles all the dirty laundry for you (under the hood, it might deal with the CUDA Driver)Examples of CUDA Libraries: CUDA FFT and CUDA BLASCUDA Highlights: Easy and LightweightThe entire CUDA API is an extension to the ANSI C programming language Low learning curveThe hardware is designed to enable lightweight runtime and driver High performance13HK-UIUCHere we go…CUDA Device Memory Allocation(Device) GridConstantMemoryTextureMemoryGlobalMemoryBlock (0, 0)Shared MemoryLocalMemoryThread (0, 0)RegistersLocalMemoryThread (1, 0)RegistersBlock (1, 0)Shared MemoryLocalMemoryThread (0, 0)RegistersLocalMemoryThread (1,
View Full Document