GT CS 4803 - Radeon HD 2000 Series
School name Georgia Tech
Pages 41

Unformatted text preview:

Spring 2010 Prof Hyesoon Kim AMD presentations from Richard Huddy and Michael Doggett Radeon 2900 2600 2400 Stream Processors 320 120 40 SIMDs 4 3 2 Pipelines 16 8 4 Texture Units 16 8 4 Render Backens 16 4 4 L2 texture cache KB 256 128 0 Technology nm 80 65 65 Area mm2 420 153 82 Transistors millions 720 390 180 Memory bandwidth 512 128 64 Optimized for High clock speed Power efficiency Power efficiency AMD presentations from Richard Huddy and Michael Dogget t 320 Stream processing units 4 SMIDs 4 Texture Units 4 Render Back end AMD presentations from Richard Huddy and Michael Dogget t AMD presentations from Richard Huddy and Michael Doggett AMD presentations from Richard Huddy and Michael Doggett GPU interface with host Processes command stream from graphics driver A custom RISC based Micro Coded engine First class memory client with Read Write access State management AMD presentations from Richard Huddy and Michael Dogget t Prepares data for processing by the stream processing units 3 groups of blocks feeding 3 data streams Each group feeding 16 elements Vertex blocks Primitive tessellation Inputs index instancing Sends vertex addresses to shader core Geometry blocks Uses on off chip staging Sends processed vertex addresses near neighbor addresses and topological information Pixel blocks Scan conversion Triangle setup Rasterizations and interpolation Interfaces to depth to perform Hiz EarlyZ checks AMD presentations from Richard Huddy and Michael Doggett Main control for the shader core Separate command queues for each shader type Each thread consists of a number of instructions that will operate on a block of input data All workloads have threads of 64 elements 100 s of threads in flight Threads are put to sleep when they request a slow responding resource AMD presentations from Richard Huddy and Michael Dogget t AMD presentations from Richard Huddy and Michael Dogget t Initial arbiter to select with thread to submit Two arbiter units per SIMD array Allows each SIMD to be pipelined with two operations at a time in process Dedicated arbiter units for texture and vertex fetches Can be scheduled independently from math operations Executing threads can be bumped at any time if a higher priority thread is pulled from the command queues Temporary data saved so thread can resume later Arbitration policy Age need availability When in doubt favor pixels Programmable AMD presentations from Richard Huddy and Michael Dogget t Dedicated shader caches Instruction cache allows unlimited shader length Constant cache allows unlimited number of constants Both caches take advantage of data re use to improve state change overhead and efficiency Latency hiding Cache miss switches to another thread Suspended threads remain in the command queues until their requested data arrives Ultra threaded dispatch processor can queue up hundreds of threads AMD presentations from Richard Huddy and Michael Dogget t 4 parallel SIMD units Each unit receives independent ALU instruction Very Long Instruction Word VLIW Each instruction word can include up to 6 independent co issued operations 5 math 1 flow control All operations are performed in parallel on each data element in the current thread Texture fetch and vertex fetch instructions are issued and executed separately Allows fetches to begin executing before the requested data is required by the shader ALU Instruction 1 to 7 64 bit words 5 scalar ops 64 bits for src dst controls op 2 additional for literal constants 5 Scalar Units Each scalar unit does FP MultiplyAdd MAD and integer operations One also handles transcendental instructions SIN COS LOG EXP etc IEEE 32 bit floating point precision Integer and bitwise operation support Branch Execution unit Up to 6 operations co issued AMD presentations from Richard Huddy and Michael Dogget t Virtualizes register space Allow overflow to graphics memory Can be read from or written to by and SIMD texture vertex caches are read only 8KB Fully associative cache write combining Stream out Allows shader output to bypass render back ends and color buffer Render to vertex buffer Outputs sequential stream of data instead of bitmaps Uses Used for inter thread communication AMD presentations from Richard Huddy and Michael Dogget t Fetch Units 8 fetch address processor each 32 total 4 filtered and unfiltered 20 texture samplers each 80 total Can fetch a single data value per clock 4 filtered texels with BW 16 total Bilinear filter one 64 bit FP color value per clocks for each pixel 128 bit FP textures filtered at half speed Trilinear and anisotropic filtering Fetch caches Unified caches across all SIMDs Vertex Unfiltered cache 4kB L1 32 Kb L2 Texture cache 32KB L1 256 KB L2 128KB for HD 2600 HD2400 uses single level vertex texture cache AMD presentations from Richard Huddy and Michael Dogget t Double rate depth stencil test 32 pixels per clock for HD 2900 8 pixels per clock for HD2600 HD2400 Programmable MSAA multi sample anti aliasing resolve Allows custom AA filters New blend able DX10 surface formats 128 bit and 11 11 10 floating point format Up to 8 Multiple Render Targets MRT with MSAA support AMD presentations from Richard Huddy and Michael Dogget t Improved Z Stencil compression Up to 16 1 in standard mode Z stencil now compressed separately with each other for better efficiency Z Range optimization Limit depth test operations to a programmable depth range useful for speeding up stencil shadowing Re Z Can check Z buffer twice once before pixel shader and again after Allows early Z before shading in all cases Improved Hierarchical Z buffer Adds hierarchical stencil HiS for better stencil shadow performance Handles most situations where it had to be disabled in the past 32 bit floating point z buffer support AMD presentations from Richard Huddy and Michael Dogget t Centralized Partially distributed Crossbar ATI Radeon X850 earlier All computing GPUs Hybrid Ring Bus ATI Radeon X1000 Series Fully distributed Ring Bus ATI Radeon HD 2000 series AMD presentations from Richard Huddy and Michael Dogget t Over 100GB s memory bandwidth Fully distributed design Highly scalable 512 bit interface Compacts stacked I O pad design More bandwidth with existing memory technology Improved cost bandwidth ratio 8x64 bit memory channels Double ring bus 512 bit read and write AMD presentations from Richard Huddy and Michael Dogget t Benefits of a 512 bit interface More bandwidth with existing memory technology Lower memory clock required to


View Full Document

GT CS 4803 - Radeon HD 2000 Series

Download Radeon HD 2000 Series
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Radeon HD 2000 Series and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Radeon HD 2000 Series 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?