UMD CMSC 828V - Hardware-Aware Analysis and Optimization of Stable Fluids - D418915

Home> Schools> University of Maryland, College Park> Computer Science (CMSC) > CMSC 828V> Hardware-Aware Analysis and Optimization of Stable Fluids

UMD CMSC 828V - Hardware-Aware Analysis and Optimization of Stable Fluids

School name University of Maryland, College Park

Course Cmsc 828v- Advanced Topics in Information Processing

Pages 25

Download Save

Unformatted text preview:

Hardware-Aware Analysis and Optimization of Stable FluidsOptimization of Stable FluidsPresentation Date: Sep 15th2009Chrissie C. CuiOutline• Introduction• Highlights• Flop and Bandwidth Analysis•MehrstellenSchemes•MehrstellenSchemes• Advection Caching• Conclusions and Future WorkIntroduction• Implement Jos Stam’s Stable Fluids fluid simulation algorithm on CPU, GPU and Cell• Detailed Flop and Bandwidth Analysis for each computational stage and each implementationcomputational stage and each implementation• Propose new schemes to solver the processor idle problem and the performance loss caused by random memory accessHighlights• Stam’s fluid algorithm is bandwidth bounded• The cores sitting idle up to 96% of the time!!• Detailed Flops and Bandwidth analysis of each computational stage computational stage • Make use of otherwise idle processors by using higher order Mehrstellen methods• Adopt a simple static caching scheme to beat the performance barrier in the semi -Lagrangian advection step (99% hit rate ~~ this is very impressive)Flops & Bandwidth Analysis - Overview• Pre-assumptions:– Ideal cache with maximum data reuse– Three rows of the computational grid fit in cache• Hardware:– CPU: SSE4 - Intel Xeon 5100 (Woodcrest)–GPU: NvidiaGeforce8800 Ultra, Geforce7900–GPU: NvidiaGeforce8800 Ultra, Geforce7900– Cell: IBM Cell QS20• Code version:– CPU: Stam’s open source implementation– GPU: Harris implementation– Cell: Theodore Kim implementation• Multiply – add is counted as one operation on all the architectures.• The following example is in 2D. The 3D results could be obtained by applying the same analysisFlops & Bandwidth Analysis – Add Source• Source Code:•Analysis:•Analysis:– Within each loop: 2 loads + 1 store– 2 Scalar components of velocity and the single density field• Flops: 3N2• Bandwidth: 9N2Flops & Bandwidth Analysis - Diffusion• Source Code:• Analysis:–Perform I iterations on each grid cell–Perform I iterations on each grid cell– 1 store for x[i][j], 1 store for x0[i][j]– Under ideal cache assumption: 1 load only for x[i][j+1]– 3 adds and 1 multiply-add in each iteration• Flops: (3+ 12I) N2• Bandwidth: (9+ 9I) N2Flops & Bandwidth Analysis – Projection I• Three sub-stages: divergence computation, pressure computation, final projection• Divergence Computation:– Source Code–Analysis–Analysis• 1 store for div[i][j], 1 load for u, 1 load for new row of v• 2 minus, 1 add, 1 multiply– Flops: 3N2– Bandwidth: 4N2• Pressure Computation:– Computed by a linear solver– The same as Divergence Computation– Flops: 3N2– Bandwidth: 4N2Flops & Bandwidth Analysis – Projection II• Final Projection– Source Code– Analysis• Loads and stores for u and v•Loads for p could be amortized into a single load•Loads for p could be amortized into a single load• 1 minus, 1 multiply add per line– Flops: 5N2– Bandwidth: 4N2• Sum up:– Flops: (8 + 4I) N2– Bandwidth: (8+3I) N2Flops & Bandwidth Analysis – Advection I• Three steps: backtrace, grid index computation and interpolation• Backtrace– Source Code– Analysis• 1 multiply add for each line• Loads from u and v– Flops: 2N2– Bandwidth: 2N2• Grid Index Computation– Source CodeFlops & Bandwidth Analysis – Advection II• Grid Index Computation:– Analysis:• The If statements can be stated as ternaries (0 Flops)• and emulate a floor function, 1 flop for eachflop for each• Local variable computation– Flops: 4N2– Bandwidth: 0 • Interpolation– Two steps: weight computation and interpolation computationFlops & Bandwidth Analysis – Advection III• Interpolation– Weights Computation• Source Code• Flops: 4N2• Bandwidth: 0–Interpolation Computation–Interpolation Computation• Source Code• Analysis– 1 load for d[i][j]– No amortize for the loads of d0 (unpredictable access pattern)– With multiply-add 6 flops• Flops: 6N2• Bandwidth: 5N2Flops & Bandwidth Analysis - Summary• To sum up– 2D case• Flops : (56 + 16I)N2•Bandwidth: (38 + 12I)N2•Bandwidth: (38 + 12I)N– 3D case (Extended from 2D)• Flops: (106 + 30I)N3• Bandwidth: (71+15I)N3Peak Performance Estimate I • Hardware specification:– CPU:• Intel Xeon 5100 • Two cores at 3 Ghz• Dispatch a 4-float SIMD instruction each clock cycle. • Peak performance: 24 GFlops/s. • Peak memory bandwidth: 10.66 GB/s– GPU:• Nvidia Geforce 8800 Ultra •128 scalar cores at 1.5 Ghz.•128 scalar cores at 1.5 Ghz.• Peak Performance :192 GFlop/s. • Peak memory bandwidth:103.7 GB/s– Cell:• IBM QS20 Cell blade • Two Cell chips at 3.2 Ghz. • 8 Synergistic Processing Elements (SPEs) per cell• Dispatching 4-float SIMD instructions every clock cycle.• Peak Performance: 204.8 GFlops/s. • Peak memory bandwidth: 25.6 GB/s• Performance Evaluation from the developed equations (Table1. on the next page)Peak Performance Estimate IITable 1: Estimated peak frames per second of Stable Fluids over different resolutions for several architectures. Peak •Performance Estimate• The ratio of computation to data arrival•2D: CPU – 6.65x faster, GPU: 5.47x faster, Cell: 23.66x faster•3D: CPU – 4.47x faster, GPU: 3.89x faster, Cell: 16.8x faster• Processer Idle Rate•2D: CPU – 85%, GPU – 82%, Cell – 96%•3D: CPU – 79%, GPU – 74%, Cell – 94%Table 1: Estimated peak frames per second of Stable Fluids over different resolutions for several architectures. Peak performance is estimated for each architecture assuming the computation is compute-bound (ie infinite bandwidth is available) and bandwidth-bound (i.e. infinite flops are available). The lesser of these two quantities is the more realistic estimate. In all cases, the algorithm is bandwidth-bound.Peak Performance Estimate III• Arithmetic Intensity• When I (Iteration #) goes to infinity?• A reasonable explanation: Algorithms runs well on the Cell and GPU when their arithmetic intensities are much greater than one. As both the 2D and 3D cases are close to one, the available flops will be underutilized.Performance Measurement• Frame RateTable 2: Theoretical peak frames per second (The bandwidth-bound values from Table 1) and actual measured frames per second. None of the measured times exceed the predicted theoretical peaks, validating the finding • Some findings– The predicted theoretical peaks

View Full Document


School:
Email:
New Password:
Confirm Password:

UMD CMSC 828V - Hardware-Aware Analysis and Optimization of Stable Fluids

Sign up for free to view:

Please select your school