DOC PREVIEW
UT CS 395T - LECTURE NOTES

This preview shows page 1-2-3-26-27-28 out of 28 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 28 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 28 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 28 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 28 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 28 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 28 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 28 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Slide 1A GPU is a Multi-core ArchitectureGraphics PipelineGraphics PipelineGraphics PipelineGraphics PipelineGraphics PipelineShader ProgrammingParallel Processing and EncapsulationStill A Scheduling ProblemCUDACUDA Programming ModelTesla ArchitectureKernelsSyntax ExampleExecutionSIMD vs SIMTCUDA Wrap UpAnother Model GRAMPSExecution GraphsStagesShaderThreads and Fixed FunctionBuffers and QueuesThread Fixed StagesShader StagesSummary GRAMPSSlide 28GPUs and GPU ProgrammingBharadwaj Subramanian, Apollo Ellis, Keshav PingaliImagery taken fromNvidia Dawn DemoSlide on GPUs, Cuda and Programming Models by Apollo EllisSlides on OpenCL by Bharadwaj SubramanianA GPU is a Multi-core Architecture•High throughput is prioritized over low latency single task execution •Large collection of fixed function and software programmable resourcesGraphics Pipeline•Virtual scene Virtual camera used to render•Direct3D and OpenGL formulate the process as a pipeline of operations on fundamental entities–Vertices–Primitives–Fragments–Pixels•Data flows in entity streams between pipeline stages.Graphics PipelineGraphics Pipeline•GPU Front End–Otherwise known as Vertex Generator•Takes in vertex descriptors: Location plus Type (Line, Triangle, Quad, Poly)–Attributes (Normal, Texture Coordinate, Color etc.)•Performs a prefetch on the vertex data and constructs a vertex stream.Graphics Pipeline•Vertex Processing–Programmable Vertex Shader Execute•Typically converts from world space to camera space•Languages include Cg and HLSL•Primitive Assembly–Convert form vertices to primitives•Rasterization–Primitive Sampler in Screen space–Fragment GeneratorGraphics Pipeline•Fragment Processing–Programmable Fragment Shader Execute•Texture Lookup and Light Interaction Calculation•Cg and HLSL•ROP–Raster Operations (Depth Buffer Cull, Alpha Blend)–Calculate each fragment’s contribution to given pixelsShader Programming•Fragment or Vertex processing is defined by shader programs written in Cg or GLSL or HLSL•Compiled at runtime to binary •Or compiled offline and then transformed at runtime•C-like function that processes a single input and output in isolation•Run in parallel on multiple shader cores•Wide SIMD instructions due to instruction streamingParallel Processing and Encapsulation•Task Parallelism is available across stages–Eg. Vertices are processed while fragments processed etc.•Data Parallelism is available across stream entities.–Each entity is independ of each other because of the task offloading onto the fixed function units•Fixed Function Units encapsulate hard to parallelize work in optimized hardware componentsStill A Scheduling Problem•Processing and on-chip resources must be dynamically reallocated to pipeline stages•Depends on the current loads at different stages•How to determine if different stages get more cores or more cache becomes an issue.•Hardware Multithreading provides a solution to thread stalls distributes resources more evenlyCUDA•CUDA is a more general data parallel model–No Pipe•Clusters of Threads•Scatter Operations (Multiple Write)•Gather Operations (Multiple Read)•Application based decomposition of threads•Threads can share data and communicate with each otherCUDA Programming Model•GPU is viewed as a coprocessor with DRAM and many parallel threads•Data parallel portions of applications can be offloaded onto this coprocessor•C on the GPU–Global and Shared Variables–Pointers and Explicit Memory Allocation–OpenGL and DirectX interoperabilityTesla Architecture•Scalable array of multithreaded Streaming Multiprocessors or SMs 768 to 12,288 concurrent threadsKernels•C C++ Simple Functions or Full Programs•Consists of thread blocks and grids–Thread Block•Set of concurrent threads that cooperate through barriers and shared memory.–Grid•Set of thread blocks that are independent form each other•Multiple Grids per KernelSyntax Example•__global__ void my_par_func(float a){do something with a } int dimGrid = 256, dimBlock 256my_par_func<<<dimGrid,dimBlock>>>(5.0f)Execution•SIMT Single Instruction Multiple Model Scheduler schedules Warps or sets of concurrent threads on SM units.•Warp is scheduled independently of other warps•If a Warps threads diverge in control flow path the paths are each executed turning off the threads that are not effected•No recursion is allowed for stack space problemsSIMD vs SIMT•CUDA utilizes the wide SIMD units•However SIMD is not exposed to the programmer•Instead SIMD units are used by multiple threads at once•SIMT utilizes of SIMDCUDA Wrap Up•More general model using same hardware•GPU is a CUDA coprocessor•Tesla Architecture 768 to 12000+ threads•C C++ syntax•Serial Branching•No recursion•SIMD used by SIMTAnother Model GRAMPS•General Runtime Architecture for Multicore Parallel Systems•A programming model for graphics pipelines•Allows for custom pipelines mixing fixed function and programmable stages•Data is exchanged using queues and buffers•Motivation comes from hybrid applications–REYES Rasterization and Ray TracingExecution Graphs•Analog of a GPU pipeline•Made up of Stages•Provides scheduling information•Not limited to execution DAGs– Cycles are not forbidden•Forward progress is not guaranteed•Flexibility presumably outweighs the cost of well behaved programs assuranceStages•Types SHADER THREAD FIXEDFUNCTION•Operate asynchronously exposing parallelism•Indicate similarities in data access and execution characteristics for efficient processing•Useful when benefits coherent execution outweigh deferred processingShader•Short live run to completion computations•Per element programs•Push operation introduced for conditional output•Otherwise queue inputs and outputs are managed automatically•Shader instances are scheduled in packets similar to GPU executionThreads and Fixed Function•Threads–Similar to CPU threads designed for task parallelism–Must be manually parallelize by the application–Useful for repacking data between Shader stages and processing bulk chunks of data where sharing or cross communication is needed•Fixed Function–Hardware unit wrappersBuffers and Queues•Buffers–essentially shared memory across stages•Queues–Packets are the primitive data format of the queue defined at creation–Opaque packets: are for data chunks which need not


View Full Document

UT CS 395T - LECTURE NOTES

Documents in this Course
TERRA

TERRA

23 pages

OpenCL

OpenCL

15 pages

Byzantine

Byzantine

32 pages

Load more
Download LECTURE NOTES
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view LECTURE NOTES and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view LECTURE NOTES 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?