GT CS 4803 - Lecture Notes - D2157948

Home> Schools> Georgia Tech> Computer Science (CS) > CS 4803> Lecture Notes

GT CS 4803 - Lecture Notes

Pages 21

Download Save

Unformatted text preview:

Spring 2011 Prof. Hyesoon Kim AMD presentations from Richard Huddy and Michael DoggettRadeon290026002400Stream Processors 32012040SIMDs432Pipelines1684Texture Units1684Render Backens1644L2 texture cache (KB)2561280Technology (nm)806565Area (mm2)42015382Transistors (millions)720390180Memory bandwidth 51212864Optimized for High clock speedPower efficiencyPower efficiencyAMD presentations from Richard Huddy and Michael Dogget t• 320 Stream processing units • 4 SMIDs • 4 Texture Units • 4 Render Back-end AMD presentations from Richard Huddy and Michael Dogget tAMD presentations from Richard Huddy and Michael DoggettAMD presentations from Richard Huddy and Michael Doggetthttp://www.chaaps.com/ati-vs-nvidia-in-2010-and-the-winner-is.html• ATI: VLIW – More various caches (vertex caches…) • NVIDA: SIMD + SFU• GPU interface with host– Processes command stream from graphics driver • A custom RISC based Micro-Coded engine• First class memory client with Read/Write access • State managementAMD presentations from Richard Huddy and Michael Dogget t• Vertex blocks: Primitive tessellation, Inputs-index & instancing – Sends vertex addresses to shader core• Geometry blocks : Uses on/off chip staging– Sends processed vertex addresses, near neighbor addresses and topological information• Pixel blocks : Scan conversion, Triangle setup, Rasterizations, and interpolation– Interfaces to depth to perform Hiz/EarlyZ checks Prepares data for processing by the stream processing units 3 groups of blocks feeding 3 data streams Each group feeding 16 elements AMD presentations from Richard Huddy and Michael Doggett• Main control for the shader core• Separate command queues for each shader type – Each thread consists of a number of instructions that will operate on a block of input data – All workloads have threads of 64 elements– 100’s of threads in flight– Threads are put to sleep when they request a slow responding resourceAMD presentations from Richard Huddy and Michael Dogget tAMD presentations from Richard Huddy and Michael Dogget t• Initial arbiter to select with thread to submit• Two arbiter units per SIMD array– Allows each SIMD to be pipelined, with two operations at a time in process • Dedicated arbiter units for texture and vertex fetches – Can be scheduled independently from math operations • Executing threads can be bumped at any time if a higher priority thread is pulled from the command queues – Temporary data saved so thread can resume later • Arbitration policy – Age/need/availability– When in doubt favor pixels – Programmable AMD presentations from Richard Huddy and Michael Dogget t• ATI has dedicated arbiter units for texture and vertex fetches. Among the following three cases, when having dedicated units will provide benefits? There are two design options. Having one unified arbiter with a doubled entry sizes or having two unified arbiters. (2 insts/cycle dispatch model) each queue has 2 entries. • Case 1: vertex, vertex, texture, texture (vertex latency 1, texture latency 1) • Case 2: vertex, vertex, vertex, vertex (vertex latency 1, texture latency 1)• Case 3: texture, texture, texture, texture, vertex (vertex latency 2, texture latency 1)• Case 4: vertex, texture, vertex, texture, vertex, texture (vertex latency 2, texture latency 1)Vertex QueueTexture QueueVertex UnitsTexture UnitsUnified QueueVertex UnitsTexture Units• Dedicated shader caches– Instruction cache allows unlimited shader length– Constant cache allows unlimited number of constants – Both caches take advantage of data re-use to improve state change overhead and efficiency• Latency hiding – Cache miss, switches to another thread– Suspended threads remain in the command queues until their requested data arrives – Ultra-threaded dispatch processor can queue up hundreds of threadsAMD presentations from Richard Huddy and Michael Dogget t• 4 parallel SIMD units• Each unit receives independent ALU instruction• Very Long Instruction Word (VLIW)– Each instruction word can include up to 6 independent, co-issued operations (5 math + 1 flow control)– All operations are performed in parallel on each data element in the current thread • Texture fetch and vertex fetch instructions are issued and executed separately – Allows fetches to begin executing before the requested data is required by the shader• ALU Instruction ( 1 to 7 64-bit words)– 5 scalar ops- 64 bits for src/dst/controls/op– 2 additional for literal constants• 5 Scalar Units – Each scalar unit does FP Multiply-Add (MAD) and integer operations – One also handles transcendental instructions (SIN, COS, LOG, EXP, etc.)– IEEE 32-bit floating point precision – Integer and bitwise operation support • Branch Execution unit • Up to 6 operations co-issued AMD presentations from Richard Huddy and Michael Dogget t• Virtualizes register space – Allow overflow to graphics memory– Can be read from or written to by and SIMD (texture & vertex caches are read-only)– 8KB Fully associative cache, write combining • Stream out– Allows shader output to bypass render back-ends and color buffer– Render to vertex buffer – Outputs sequential stream of data instead of bitmaps• Uses: Used for inter-thread communication AMD presentations from Richard Huddy and Michael Dogget t• Fetch Units– 8 fetch address processor each (32 total)• 4 filtered and unfiltered– 20 texture samplers each (80 total)• Can fetch a single data value per clock – 4 filtered texels (with BW) (16 total)• Bilinear filter one 64-bit FP color value per clocks for each pixel • 128-bit FP textures filtered at half speed• Trilinear and anisotropic filtering • Fetch caches – Unified caches across all SIMDs – Vertex/Unfiltered cache • 4kB L1, 32 Kb L2 – Texture cache • 32KB L1, 256 KB L2 (128KB for HD 2600, HD2400 uses single level vertex/texture cache) AMD presentations from Richard Huddy and Michael Dogget t• Double rate depth/stencil test – 32 pixels per clock for HD 2900 – 8 pixels per clock for HD2600&HD2400• Programmable MSAA (multi-sample anti-aliasing) resolve – Allows custom AA filters • New blend-able DX10 surface formats – 128-bit and 11:11:10 floating point format • Up to 8 Multiple Render Targets (MRT) with MSAA support AMD presentations from Richard Huddy and Michael Dogget t• Improved Z & Stencil

View Full Document


School:
Email:
New Password:
Confirm Password:

GT CS 4803 - Lecture Notes

Sign up for free to view:

Please select your school