UW-Madison ME 964 - High Performance Computing for Engineering Applications - D237719

Home> Schools> University of Wisconsin, Madison> Mechanical Engineering (ME) > ME 964> High Performance Computing for Engineering Applications

DOC PREVIEW

UW-Madison ME 964 - High Performance Computing for Engineering Applications

School name University of Wisconsin, Madison

Course Me 964- High Performance Computing for Engineering Applications

Pages 21

This preview shows page 1-2-20-21 out of 21 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 21 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 21 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 21 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 21 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 21 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

ME964 High Performance Computing for Engineering ApplicationsBefore we get started…The CUDA Access SituationHW2: A word on getting started with CUDAExecution Configuration: Grids and BlocksBlock and Thread IDsCUDA Device Memory Space OverviewGlobal, Constant, and Texture Memories (Long Latency Accesses by Host)End: Memory Layout on the GPU Begin: CUDA APIWhat is an API?OverviewTalking about the API: The CUDA Software StackCUDA Highlights: Easy and LightweightCUDA Device Memory AllocationA Small Detour: A Matrix Data TypeCUDA Device Memory Allocation (cont.)CUDA Host-Device Data TransferCUDA Host-Device Data Transfer (cont.)CUDA Function DeclarationsCUDA Function Declarations (cont.)Calling a Kernel Function, and the Concept of Execution ConfigurationME964High Performance Computing for Engineering ApplicationsCUDA Memory Model& CUDA APISept. 16, 2008Before we get started…Last Time Traced back the evolution of the GPUGPGPU and the CUDA step forwardCUDA-related nomenclatureMemory layout of typical NVIDIA GPUTodayThe CUDA APIStart discussing CUDA programming modelA look at a matrix multiplication example2The CUDA Access SituationYou can install CUDA on your computer even if you don’t have a GPU cardYou can do 95% of your HW2 without needing a GPU Cards to be installed this afternoon in 1235MECAE doesn’t want to made Visual Studio 2005 available (they use Visual Studio 2008)I’m looking into opening up my lab in case 1235ME doesn’t prove to be an optionLinux accounts available at UIUC on GPU based supercomputerSee Forum posting about details3After you unzip the emailed assignment file, you should get a collection of files like below:HW2: A word on getting started with CUDADouble click hello world.sln to get startedThe directory Linux contains a makefile and required files to get you going with this OSNOTE: readme.doc contains the text of the assignment4Execution Configuration: Grids and BlocksA kernel is executed as a grid of blocks of threadsAll threads share global memory spaceA block [of threads] is a batch of threads that can cooperate with each other by:Synchronizing their executionFor hazard-free shared memory accessesEfficiently sharing data through a low latency shared memoryThreads from two different blocks cannot cooperate!!!This has important software design implicationsHostKernel 1Kernel 2DeviceGrid 1Block(0, 0)Block(1, 0)Block(2, 0)Block(0, 1)Block(1, 1)Block(2, 1)Grid 2Block (1, 1)Thread(0, 1)Thread(1, 1)Thread(2, 1)Thread(3, 1)Thread(4, 1)Thread(0, 2)Thread(1, 2)Thread(2, 2)Thread(3, 2)Thread(4, 2)Thread(0, 0)Thread(1, 0)Thread(2, 0)Thread(3, 0)Thread(4, 0)Courtesy: NDVIA5HK-UIUCBlock and Thread IDsThreads and blocks have IDsSo each thread can decide what data to work onBlock ID: 1D or 2DThread ID: 1D, 2D, or 3D Why this 2D and 3D layout?Simplifies memoryaddressing when processingmultidimensional dataImage processingSolving PDEs on subdomains…DeviceGrid 1Block(0, 0)Block(1, 0)Block(2, 0)Block(0, 1)Block(1, 1)Block(2, 1)Block (1, 1)Thread(0, 1)Thread(1, 1)Thread(2, 1)Thread(3, 1)Thread(4, 1)Thread(0, 2)Thread(1, 2)Thread(2, 2)Thread(3, 2)Thread(4, 2)Thread(0, 0)Thread(1, 0)Thread(2, 0)Thread(3, 0)Thread(4, 0)Courtesy: NDVIA6HK-UIUCCUDA Device Memory Space OverviewEach thread can:R/W per-thread registersR/W per-thread local memoryR/W per-block shared memoryR/W per-grid global memoryRead only per-grid constant memoryRead only per-grid texture memory(Device) GridConstantMemoryTextureMemoryGlobalMemoryBlock (0, 0)Shared MemoryLocalMemoryThread (0, 0)RegistersLocalMemoryThread (1, 0)RegistersBlock (1, 0)Shared MemoryLocalMemoryThread (0, 0)RegistersLocalMemoryThread (1, 0)RegistersHostThe host can R/W global, constant, and texture memory7IMPORTANT NOTE: Global, constant, and texture memory spaces are persistent across kernels called by the same host application.HK-UIUCGlobal, Constant, and Texture Memories(Long Latency Accesses by Host)Global memoryMain means of communicating R/W Data between host and deviceContents visible to all threadsTexture and Constant MemoriesConstants initialized by host Contents visible to all threads(Device) GridConstantMemoryTextureMemoryGlobalMemoryBlock (0, 0)Shared MemoryLocalMemoryThread (0, 0)RegistersLocalMemoryThread (1, 0)RegistersBlock (1, 0)Shared MemoryLocalMemoryThread (0, 0)RegistersLocalMemoryThread (1, 0)RegistersHostCourtesy: NDVIA8HK-UIUCNOTE: We will not emphasize texture memory in this class.9End: Memory Layout on the GPUBegin: CUDA APIWhat is an API?Application Programming Interface (API)A set of functions, procedures or classes that an operating system, library, or service provides to support requests made by computer programs (from Wikipedia)Example: OpenGL, a graphics library, has its own API that allows one to draw a line, rotate it, resize it, etc.Cooked up analogy (for the mechanical engineer)Think of a car, you can say it has a certain Device Operating Interface (DOI):A series of pedals, gauges, handwheel, etc. This would be its DOIIn this context, CUDA is the API that enables you to tap into the computational resources of the NVIDIA GPUThis is what replaced the old GPGPU way of programming the hardware10OverviewCUDA programming model – basic concepts and data typesCUDA application programming interface - basicSimple example to illustrate basic concepts and functionality11HK-UIUCPerformance features will be covered laterTalking about the API:The CUDA Software StackImage at right indicates where the API fits in the picture12An API layer is indicated by a thick red line:Dealing with the CUDA Driver API is tediousWe’ll only discuss the CUDA Runtime API, which handles all the dirty laundry for you (under the hood, it might deal with the CUDA Driver)Examples of CUDA Libraries: CUDA FFT and CUDA BLASCUDA Highlights: Easy and LightweightThe entire CUDA API is an extension to the ANSI C programming language Low learning curveThe hardware is designed to enable lightweight runtime and driver High performance13HK-UIUCHere we go…CUDA Device Memory Allocation(Device) GridConstantMemoryTextureMemoryGlobalMemoryBlock (0, 0)Shared MemoryLocalMemoryThread (0, 0)RegistersLocalMemoryThread (1, 0)RegistersBlock (1, 0)Shared MemoryLocalMemoryThread (0, 0)RegistersLocalMemoryThread (1,

View Full Document