DOC PREVIEW
GT ECE 4893 - Lecture 8: GPU Architectures
School name Georgia Tech
Pages 20

This preview shows page 1-2-19-20 out of 20 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 20 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 20 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 20 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 20 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 20 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Lecture 8: GPU ArchitecturesProf. Aaron LantermanSchool of Electrical and Computer EngineeringGeorgia Institute of Technology2Bandwidth –Gravity of Modern Computer Systems• The bandwidth between key componentsultimately dictates system performance– Especially true for massively parallel systemsprocessing massive amount of data– Tricks like buffering, reordering, caching cantemporarily defy the rules in some cases– Ultimately, the performance goes falls backto what the “speeds and feeds” dictateSlide by David Kirk/NVIDIA and Wen-mei. W. Hwu, 2007,from UIUC ECE498 Lecture 6, Fall 2007; used with permission See courses.ece.uiuc.edu/ece498/al13Interface “feeds and speeds”• AGP: Advanced Graphics Port – aninterface between the computer core logic and thegraphics processor– AGP 1x: 266 MB/sec – twice as fast as PCI– AGP 2x: 533 MB/sec– AGP 4x: 1 GB/sec  AGP 8x: 2 GB/sec– 256 MB/sec readback from graphics to system• PCI-E: PCI Express – a faster interface betweenthe computer core logic and the graphics processor– PCI-E 1.0: 4 GB/sec each way  8 GB/sec total– PCI-E 2.0: 8 GB/sec each way  16 GB/sec totalSlide by David Kirk/NVIDIA and Wen-mei. W. Hwu, 2007,from UIUC ECE498 Lecture 5, Fall 2007; used with permission See courses.ece.uiuc.edu/ece498/al143D Buzzwords• Fill Rate – how fast the GPU cangenerate pixels, often a strong predictorfor application frame rate• Performance Metrics– Mtris/sec - Triangle Rate– Mverts/sec - Vertex Rate– Mpixels/sec - Pixel Fill (Write) Rate– Mtexels/sec - Texture Fill (Read) Rate– Msamples/sec - Antialiasing Fill (Write) RateSlide by David Kirk/NVIDIA and Wen-mei. W. Hwu, 2007,from UIUC ECE498 Lecture 5, Fall 2007; used with permission See courses.ece.uiuc.edu/ece498/al15Adding Programmability to the Graphics Pipeline3D Applicationor Game3D API:OpenGL orDirect3DProgrammableVertexProcessorPrimitiveAssemblyRasterization &Interpolation3D APICommandsTransformedVerticesAssembledPolygons,Lines, andPointsGPUCommand &Data StreamProgrammableFragmentProcessorRasterizedPre-transformedFragmentsTransformedFragmentsRasterOperationsFramebufferPixelUpdatesGPUFrontEndPre-transformedVerticesVertex IndexStreamPixelLocationStreamCPU – GPU BoundarySlide by David Kirk/NVIDIA and Wen-mei. W. Hwu, 2007,from UIUC ECE498 Lecture 5, Fall 2007; used with permission See courses.ece.uiuc.edu/ece498/al16Specialized Instructions (GeForce 6)• Dot products• Exponential instructions:– EXP, EXPP, LOG, LOGP– LIT (Blinn specular lighting model calculation!)• Reciprocal instructions:– RCP (reciprocal)– RSQ (reciprocal square root!)• Trignometric functions– SIN, COS• Swizzling (swapping xyzw), write masking (only somexyzw get assigned), and negation is “free”From GPU Gems 2, p. 4847Easy cross products and normalizationFrom Stanford CS448A: Real-Time Graphics ArchitecturesSee graphics.stanford.edu/courses/cs448a-01-fall8Blinn lighting in one instructionFrom Stanford CS448A: Real-Time Graphics ArchitecturesSee graphics.stanford.edu/courses/cs448a-01-fall9Simple graphics pipelineFrom Stanford CS448A: Real-Time Graphics ArchitecturesSee graphics.stanford.edu/courses/cs448a-01-fall10The GeForce Graphics PipelineHostVertex ControlVertex CacheVS/T&LTriangle SetupRasterShaderROPFBITextureCacheFrameBufferMemorySlide by David Kirk/NVIDIA and Wen-mei. W. Hwu, 2007,from UIUC ECE498 Lecture 5, Fall 2007; used with permission See courses.ece.uiuc.edu/ece498/al111Vertex Cache• Temporary store for vertices, used to gainhigher efficiency• Re-using vertices between primitivessaves AGP/PCI-E bus bandwidth• Re-using vertices between primitivessaves GPU computational resources• A vertex cache attempts to exploit“commonality” between triangles togenerate vertex reuse• Unfortunately, many applications do notuse efficient triangular orderingHostVertex ControlVertex CacheVS/T&LTriangle SetupRasterShaderROPFBITextureCacheFrameBufferMemorySlide by David Kirk/NVIDIA and Wen-mei. W. Hwu, 2007,from UIUC ECE498 Lecture 5, Fall 2007; used with permission See courses.ece.uiuc.edu/ece498/al112Slide by David Kirk/NVIDIA and Wen-mei. W. Hwu, 2007,from UIUC ECE498 Lecture 5, Fall 2007; used with permission See courses.ece.uiuc.edu/ece498/al1Texture Cache• Stores temporally local texel valuesto reduce bandwidth requirements• Due to nature of texture filtering highdegrees of efficiency are possible• Efficient texture caches can achieve75% or better hit rates• Reduces texture (memory)bandwidth by a factor of four forbilinear filteringHostVertex ControlVertex CacheT&LTriangle SetupRasterShaderROPFBITextureCacheFrameBufferMemory13Built-in Texture Filtering (GeForce 6)• Pixel texturing– Hardware supports 2D, 3D, and cube map– Non power-of-2 textures OK– Hardware handles addressing and interpolation foryou• Bilinear, trilinear (3D or mipmap), anisotropic• Vertex texturing– Vertex processors can access texture memory too– Only nearest-neighbor filtering supported in G60hardware14ROP (from Raster Operations)• C-ROP performs frame buffer blending– Combinations of colors and transparency– Antialiasing– Read/Modify/Write the Color Buffer• Z-ROP performs the Z operations– Determine the visible pixels– Discard the occluded pixels– Read/Modify/Write the Z-Buffer• ROP on GeForce also performs– “Coalescing” of transactions– Z-Buffer compression/decompressionHostVertex ControlVertex CacheT&LTriangle SetupRasterShaderROPFBITextureCacheFrameBufferMemorySlide by David Kirk/NVIDIA and Wen-mei. W. Hwu, 2007,from UIUC ECE498 Lecture 5, Fall 2007; used with permission See courses.ece.uiuc.edu/ece498/al115The Frame Buffer• The primary determinant of graphicsperformance other than the GPU• The most expensive component of agraphics product other than the GPU• Memory bandwidth is the key• Frame buffer size also determines– Local texture storage– Maximum resolutions– Anitaliasing resolution limitsHostVertex ControlVertex CacheT&LTriangle SetupRasterShaderROPFBITextureCacheFrameBufferMemorySlide by David Kirk/NVIDIA and Wen-mei. W. Hwu, 2007,from UIUC ECE498 Lecture 5, Fall 2007; used with permission See courses.ece.uiuc.edu/ece498/al116Frame Buffer Interface (FBI)• Manages reading from and writingto frame buffer• Perhaps the most performance-critical component of a GPU• GeForce’s FBI is a crossbar• Independent memory controllers for4+ independent memory banks formore efficient


View Full Document

GT ECE 4893 - Lecture 8: GPU Architectures

Documents in this Course
Load more
Download Lecture 8: GPU Architectures
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture 8: GPU Architectures and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture 8: GPU Architectures 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?