Unformatted text preview:

ECE 498AL Lecture 18: Performance Case Studies: Ion Placement Tool, VMD Guest Lecture by John Stone Theoretical and CompuObjectiveMolecular Modeling: Ion PlacementEvolution of Ion Placement CodeIon Placement AlgorithmComputational Profile of the AlgorithmCoulombic Potential Map Slice: Simplest C Version GFLOPS? Don’t ask…Algorithm Design ObservationsObservations and Challenges for GPU ImplementationPlan for CUDA Coulombic Potential Map CodeCUDA Block/Grid DecompositionVersion 1: Tex Memory 90 GFLOPS, 9 Billion Atom Evals/SecVersion 1 Inner Loop StructureVersion 2: Const+Precalc 150 GFLOPS, 16.7 Billion Atom Evals/SecVersion 2: Kernel StructureVersion 3: Const+Precalc+Loop Unrolling 226 GFLOPS, 33 Billion Atom Evals/SecVersion 3: Inner LoopVersion 4: Const+Shared+Loop Unrolling+Precalc 235 GFLOPS, 34.8 Billion Atom Evals/SecVersion 4: Kernel StructureCalculating Potential Maps in ParallelParallel GPUs with Multithreading: 705 GFLOPS /w 3 GPUsMulti-GPU CUDA Coulombic Potential Map PerformanceNever Trust Compilers (With apologies to Wen-mei and David)Early Experiences Integrating CUDA Kernels Into VMDVMD/CUDA Integration ObservationsVMD/CUDA Integration Observations (2)VMD/CUDA Resource ManagementVMD/CUDA Resource Management (2)VMD/CUDA Code OrganizationVMD/CUDA Code Organization (2)© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign1ECE 498ALLecture 18: Performance Case Studies: Ion Placement Tool, VMDGuest Lecture by John StoneTheoretical and Computational Biophysics GroupNIH Resource for Macromolecular Modeling and BioinformaticsBeckman Institute for Advanced Science and Technology© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign2Objective• To learn design, implementation, and testing strategies for GPU acceleration of existing software using CUDA– Identify performance-critical software modules– Decompose identified modules into kernels which may benefit from GPU acceleration– Detailed examination of Coulombic potential map code– Abstract the implementation of the computational kernel so that caller need not worry about the low level details– Software structures supporting GPU acceleration© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign3Molecular Modeling: Ion Placement• Biomolecular simulations attempt to replicate in vivoconditions in silico.• Model structures are initially constructed in vacuum• Solvent (water) and ions are added as necessary for the required biological conditions• Computational requirements scale with the size of the simulated structure© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign4Evolution of Ion Placement Code• First implementation was sequential• Repeated scientific methodological revisions improved results• As the size of simulated structures increased, the performance of the code became much more important• Virus structure with 10^6 atoms would require 10 CPU days• Tuned for Intel C/C++ vectorization+SSE, ~20x speedup• Parallelized /w pthreads: high data parallelism = linear speedup• Parallelized GPU accelerated implementation: Three GeForce 8800GTX cards outrun ~300 CPUs!• Virus structure now runs in 25 seconds on 3 GPUs!• Seems impossible until one considers how much faster GPUsare for graphics than a general purpose CPU…• Further speedups should still be possible…© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign5Ion Placement Algorithm• Calculate initial Coulombic electrostatic potential map around the simulated structure:– For each voxel, sum potential contributions for all atoms in the simulated structure: potential += charge[i] / (distance to atom[i])• Place ions one at a time:– Find the voxel containing the minimum potential value– Add a new ion atom centered on the minimum voxel position– Update the potential map adding the potential contribution of the newly placed ion– Repeat until the required number of ions have been added© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign6Computational Profile of the Algorithm• Over 99% of the run time of the algorithm is consumed in the initial potential map calculation, since the number of ions is always tiny compared to the size of the simulated system.• Direct summation of electrostatic potentials is “safe” in terms of numerical accuracy, and is highly data parallel• Interesting GPU test case since coulombic potential maps are useful for many other calculations• Faster approximation algorithms currently in testing…© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign7Coulombic Potential Map Slice: Simplest C VersionGFLOPS? Don’t ask…void cenergy(float *energygrid, dim3 grid,, float gridspacing, float z, const float *atoms, int numatoms) {int i,j,n;int atomarrdim = numatoms * 4;for (j=0; j<grid.y; j++) {float y = gridspacing * (float) j;for (i=0; i<grid.x; i++) {float x = gridspacing * (float) i;float energy = 0.0f;for (n=0; n<atomarrdim; n+=4) { // calculate potential contribution of each atomfloat dx = x - atoms[n ];float dy = y - atoms[n+1];float dz = z - atoms[n+2];energy += atoms[n+3] / sqrtf(dx*dx + dy*dy + dz*dz);}energygrid[grid.x*grid.y*k + grid.x*j + i] = energy;}}}© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign8Algorithm Design Observations• Voxel coordinates are generated on-the-fly• Atom coordinates are translated to the map origin in advance, eliminating redundant work• Ion placement maps require ~20 potential voxels/atom• Atom list has the smallest memory footprint, best choice for the inner loop ( both CPU and GPU)• Arithmetic can be reduced by creating a new atom list containing X, Q, and a precalculation of dy^2 + dz^2, updated for each row (CPU)© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign9Observations and Challenges for GPU Implementation• The innermost loop will consume operands VERY quickly• Straightforward implementation has a low ratio of floating point arithmetic operations to memory transactions (for a GPU)• Since atoms are read-only calculation, they are ideal candidates for texture memory or const memory• GPU implementation must avoid bank


View Full Document

U of I ECE 498 - Lecture 18

Documents in this Course
Load more
Download Lecture 18
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture 18 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture 18 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?