Finite Element Simulation of Nonlinear ElasticDynamics Using CudaChristopher CameronMay 10, 2009The ProblemCompute how an elastic object deforms over time when subjected toexternal forces.• Many applications including stress-testing buildings and vehicles.• Artifically constrain implementation to fit on one GPU (small-scalesimulations are still have applications, e.g, surgical simulation).1The Discrete Time Differential EquationGiven the inputs• A discretized domain Ω with n nodes and e elements• A time-varying vector q(t) ∈ R3nwhich describes the displacement ofthe nodes of the mesh• A mass matrix M ∈ R3n×3ndescribing the mass distribution in t he body• Internal force function resulting from deformation, fint: R3n→ R3n, andits derivative f0int: R3n→ R3n×3n• A time-varying external force function from user input fext(t) ∈ R3nSolve the differential equation (looks like Newton’s second law)M ¨q(t) = fint(q(t)) + fext(t)2Solution DetailsRunning a simulation consists of, for each time value tiin t1, ..., tm,computing q(ti+1) from q(ti). This involves1. For each element, compute fextand f0extfor just that element.– For the 8-node brick elements used, fext∈ R24and f0ext∈ R24×242. Assemble the per-element fextand f0extinto whole-mesh fextand f0ext3. Solve a sparse sym me tric positive definite linear system involving fext,f0ext, and MWe focus on step 1 in this project. The remaining steps are very commonand well-studied.3Implementation Details• Implemented on a GeForce 8800 GT with 512 MB of memory.• Only uses single precision floating-point.– GPUs with double precision are available (just expensive).– Has very severe stability implications.• One thread per element.– No communication between elements is nece ssary until assembly stage.– More threads in flight means more opportunity to hide latency.– One thread per quadrature point possible, but more com plicated andresults in more communication or over-computation.• Use texture to read thread input values (e.g, node positions,displacement, etc).– Texture has a cache to lower latency– Using texture removes need to do coalesced reads4Future work• Exploit capabilities of newer hardware (e.g GeForce GTX 280).– Double precision support is available now.– Reading/writing coalescing constraints have be en relaxed∗ No need to use texture for reading.∗ Potentially merge per-element com putation step with assembly s tep.• Full end-to-end solution on GPU– Perform assembly and and conjugate gradient solve on the GPU.– Should be much faster due to∗ No data transfer betwee n CPU and GPU∗ GPU implementation of CG solve should be faster than CPU• Extend to larger problems– Multi-GPU and multi-system implementations.– Double precision support allows larger systems to be stable.5The
View Full Document