Unformatted text preview:

EE482C: Advanced Computer Organization Lecture #14Stream Processor ArchitectureStanford University Tuesday, 21st May 2002Graphics PapersLecture #14: Tuesday, 21st May 2002Lecturer: Prof. Bill DallyScribe: Timothy Knight, Jung Ho AhnReviewer: Mattan Erez1 Comparison of Different Architectural ApproachesThe following table(s) contain description of the techniques used to handle parallelism,the hiding of memory latency, gaining sufficient memory bandwidth, and design andprogramming complexity for the 3 different architectural approaches discussed in class.Vectors Multithreading StreamsParallelism Data level parallelism atthe operation level.Thread level parallelism.• Synchronizationcosts are high.• Instruction cost -instruction fetchesper computation.• More flexible(MIMD vs. SIMD)Data level parallelism atthe kernel level.• Gives better reusethan data paral-lelism at the opera-tion level.Table 1: Comparison of Different Architectural Approaches2 EE482C: Lecture #14Vectors Multithreading StreamsMemory La-tencyHides latency by:• Overlapping mem-ory access withcomputation.• Amortizing the la-tency cost over thelength of the vector.Hides latency by:• Switching threads -do something elsewhile waiting.• Replicating state toallow fast contextswitching.Enables data-dependentpointer chasing, sincewith vectors and streamsneed to know ahead oftime the memory ad-dresses needed in orderto amortize latency.Hides latency by:• Overlapping mem-ory access withcomputation.• Amortizing thelatency cost overthe length of thestream.• Exploitingproducer-consumerlocality.Bandwidth Need a very high band-width memory system.No advantages over con-ventional processors.Threads compete for thecache.• Can get interfer-ence, or synergisticsharing.Reduces demand withproducer-consumer local-ity. Can get some ofthis advantage in a con-ventional machine usingacache,butastreammachine has a betterbandwidth heirarchy, il-lustrated in figure 1.Table 2: Comparison of Different Architectural Approaches (cont.)EE482C: Lecture #14 3Vectors Multithreading StreamsDesignComplexitySimplest to design. High complexity:• Multiple instruc-tion.• Synchronization.• etc.Moderate complexity:• Microcontroller.• LRFs.Program.ComplexityEasy - vector ops. Familiar but difficult:• Synchronization.• ‘Thread-safe’ code.Need to make somethings explicit, such asglobal references.Table 3: Comparison of Different Architectural Approaches (cont.)MemorySRFLRFALUs115250MemoryCacheRegistersALUs14144StreamArchitectureConventionalArchitectureFigure 1: Bandwidth Hierarchies: Imagine vs. Pentium-44 EE482C: Lecture #142 NVIDIA Graphics PaperThe NVIDIA paper ‘A User-Programmable Vertex Engine’ by Lindholm et. al. wasdiscussed.• It presents an introduction of stream processing into a fixed function pipeline.• They removed a fixed stage and replaced it with a ‘stream processor’ without amemory system - basically, a cluster.• The question was raised: Are graphics chips becoming stream processors? Due tobeing in a price sensitive market, they will most likely never become truly generalpurpose, rather become multithreaded SIMD processor. But they are trending inthat direction.• Will a graphics processor ever be used to do general purpose computation? Maybefor niche applications; difficulties with making random memory accesses. Morelikely to have a processor extension which is stream-like - need to standardize aplatform so that software vendors can target it.3 Purcell Graphics PaperThe paper ‘Ray Tracing on Programmable Graphics Hardware’ by Purcell et. al. wasdiscussed.• They’re doing random memory accesses through the texture cache.• They brought up the issue of multipath vs conditional loop. They want branchingfor conditional loops and not for arbitrary conditionals in order to achieve theefficiency.• They made a very high level simulator of a new (non-existing) graphics chip toperform their experiments. This is a pretty efficient way to work, because noprocessor exists right now which satisfies their architectural necessities. They tookperformance values from a current NVIDIA chip and extrapolated those for whatthey wanted.• They need an SRF? There exists locality, so if they could keep the state of raysaround, less memory bandwidth would be needed.• GPU (graphic processing units) are becoming more general, but are still essentiallydedicated graphics chips.EE482C: Lecture #14 54 Other Comments• For general storage units, the bigger the slower.• Some pitfalls in Imagine which limit its use as a graphics processor, specifically ithas fewer arithmetic units per unit area than specialized graphics chips.• Pin packaging trends and issues were introduced - number of pins on a packageis increasing much more slowly than number of transistors in a chip. Pins can betime-multiplexed at a very high frequency to attain a greater effective bandwidthinto and out of the


View Full Document

Stanford EE 482C - Graphics Papers

Download Graphics Papers
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Graphics Papers and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Graphics Papers 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?