Introduction to Parallel Rendering Jian Huang CS 594 Spring 2002 Parallel Rendering Graphics rendering process is computationally intensive Parallel computation is a natural measure to leverage for higher performance Two levels of parallelism Functional parallelism pipelining Data parallelism multiple results computed at the same time Data Parallel Algorithms A lot of taxonomies of categorizing parallel algorithms Image space vs object space Shared memory architecture distributed memory architecture MPI OpenMP Need a uniform framework to study and understand parallel rendering A Core Problem To partition work and distribute them Rendering requires the involved data to go with each work partition How to collect the rendered results to form the final image Intensive data communication A landmark paper A sorting classification of parallel rendering Molner et al IEEE CG A 94 The Rendering Process The rendering pipeline Geometry processing transformation lighting clipping Rasterization scan conversion shading visibility Parallel processing A New Perspective Rendering as a sorting process Sort from object coordinates to screen coordinates Use this concept to study computational and communication costs Sorting Parallel geometry processing assign a subset of primitives or objects to each processor Parallel rasterization processing assign a subregion of the screen to each processor The key procedure calculating the effect of each primitive on each pixel Rendering is a sorting process from each primitive into each screen pixel This sort involves redistributing data among processors Where does the sort take place The location of this sort determines the structure of the parallel algorithm This sort can take place during Geometry processing sort first Between geometry processing and rasterization sort middle Rasterization sort last Each different sort has distinct properties Sort First Redistributing raw primitives Sort Middle Redistributing screen primitives Sort Last Redistributing fragments samples or pixels Processing and Communication Model A refined model Assume a dataset containing nr raw primitives with average size ar We will call primitives that result from tessellation display primitives If T is the tessellation ratio there are nd Tnr of these with average size ad ar T If there is no tessellation T 1 nd nr and ad ar Assume an image containing A pixels and need to compute S samples per pixel Assume that all primitives within the viewing frustum The Terms Analysis of Sort First c proportion of primitives to be redistributed Advantages Low communication requirements when the tessellation ratio and the degree of oversampling are high or when frame toframe coherence can be exploited Processors implement entire rendering pipeline for a portion of the screen Disadvantages Susceptible to load imbalance Primitives may clump into regions concentrating the work on a few renderers To take advantage of frame to frame coherence retained mode and complex data handling code are necessary Cost over uni processor rendering Analysis of Sort Middle Advantages General and straightforward redistribution occurs at a natural place in the pipeline Disadvantages High communication costs if tessellation ratio is high Susceptible to load imbalance between rasterizers when primitives are distributed unevenly over the screen Cost over uni processor rendering Analysis of Sort Last Sparse merging only merge the small region each processor renders Full frame merging always merge the whole frame buffer Advantages Renderers implement the full rendering pipeline and are independent until pixel merging Less prone to load imbalance SL full merging can be embedded in a linear network making it linearly scalable Disadvantage Pixel traffic may be extremely high particularly when oversampling Cost over uni processor rendering A Comparison Sort first sort middle and sort last There is no strictly the best category Different categories can be combined in an implementation as well Parallel Volume Rendering A lot of algorithms choose to distribute data to processing nodes Each node renders its portion of data and generate a partial image The partial images then get accumulated together Screen space partitioning tiles or continuous scan lines are also used Load Balancing For better load balancing Task queuing the task queue can be ordered in decreasing task size such that the concurrency gets finer until the queue is exhausted Load stealing having nodes steal smaller tasks from other nodes once they have completed their own tasks Time stamp timeout stamps used for each task such that if the node can not finish its task before the timeout it takes the remnant of the task re partitions it and re distributes it Hierarchical data structures such as octree k d tree etc are commonly used Sum It Up Parallelism is just a tool It depends on which algorithm you parallelize What about an OpenGL application utilizing occlusion culling
View Full Document
Unlocking...