UT CS 378 - Fast Multipole Methods on Graphics Processors - D2088049

Home> Schools> University of Texas at Austin> Computer Science (CS) > CS 378> Fast Multipole Methods on Graphics Processors

DOC PREVIEW

UT CS 378 - Fast Multipole Methods on Graphics Processors

School name University of Texas at Austin

Course Cs 378- Declarative Programming

Pages 30

This preview shows page 1-2-14-15-29-30 out of 30 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 30 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 30 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 30 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 30 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 30 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 30 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 30 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Fast Multipole Methods on Graphics ProcessorsNail A. Gumerov and Ramani DuraiswamiPerceptual Interfaces and Reality Lab.Computer Science and UMIACSUniversity of Maryland, College Park{gumerov,ramani}@umiacs.umd.eduFantalgo, LLC, Elkridge [email protected] www.fantalgo.com21 Nov 2007AbstractThe Fast Multipole Method allows the rapid evaluation of sums of radial basis functions centered atpoints distributed inside a computational domain at a large number of evaluation points to a specifiedaccuracy . The method scales as O (N) compared to the direct method with complexity O(N2), whichallows one to solve larger scale problems. Graphical processing units (GPU) are now increasingly viewedas data parallel compute coprocessors that can provide significant computational performance at lowprice. We describe acceleration of the FMM using the data parallel GPU architecture.The FMM has a complex hierarchical (adaptive) structure, which is not easily implemented on data-parallel processors. We described strategies for parallelization of all components of the FMM, developa model to explain the performance of the algorithm on the GPU architectures, and determined optimalsettings for the FMM on the GPU, which are different from those on usual CPUs. Some innovations in theFMM algorithm, including the use of modified stencils, real polynomial basis functions for the Laplacekernel, and decompositions of the translation operators, are also described.We obtained accelerations of the Laplace kernel FMM on a single NVIDIA GeForce 8800 GTX GPUin the range 30-60 compared to a serial CPU implementation for benchmark cases of up to million size.For a problem with a million sources, the summations involved are performed in approximately onesecond. This performance is equivalent to solving of the same problem at 24-43 Teraflop rate if we usestraightforward summation.1 IntroductionFast Multipole Methods (FMM) were invented in the late 1980s [6], and have since been identified as oneof the ten most important algorithmic contributions of the 20th century [4]. The FMM is widely used inmany areas because of its ability to achieve linear time and memory dense matrix vector products to a fixedprescribed accuracy  for problems arising in diverse areas (molecular dynamics, astrophysics, acoustics,fluid mechanics, electromagnetics, scattered data interpolation etc.).Graphics Processing Units: A recent hardware trend is for the development of highly capable data-parallel processors, with their origin in the gaming and graphics industries. In 2007, while the fastest IntelCPU can only achieve ~20 Gflops speeds on benchmarks, the GPUs have speeds that are more than an orderof magnitude higher. Of course, the GPU performs highly specialized tasks, while the CPU is a generalpurpose processor. Fig. 1 shows the relative abilities of GPUs and CPUs (on separate benchmarks) in early2006. The trends reported are expected to continue for the next few years.1Figure 1: GPU and CPU growth in speed over the last 6 years.While GPUs were originally intended for specialized graphics operations, it is often possible to performgeneral purpose computations on them, and this has been a vigorous area of research over the last few years.One of the problems that a number of researchers have tackled for speed up is the so called N body problem,in which the Coulombic (or gravitational) potentials and/or forces exerted by all charges (particles) in asystem on each other are evaluated. This force calculation is part of a time stepping procedure in whichthe accelerations are evaluated, while the potential is necessary for energy computations. Recently severalresearchers have reported the use of GPUs, either in isolation or in a cluster to speed up these computationsusing direct algorithms, in which the interaction of every pair of particles is considered (see e.g., [20, 19, 16]).While impressive speedups are reported, these algorithms suffer from the ON2nature of the algorithm.Efforts in Japan, under the auspices of the Grape project, have sought to develop special hardware, similarto the GPU, for performing N-body calculations as well (see e.g., [22, 23]). The substantial investmentsrequired to create special purpose hardware have meant that the progress in new generations has been slow(about 7 years since the previous Grape version was introduced). In contrast, the market driven developmentof the GPUs has lead to yearly introduction of new and improved hardware.Pre sent contribution: We map the complex FMM algorithm on to the GPU architecture. The idea toparallelize the FMM is not new and there are publications which address some basic issues [7, 13, 14, 15, 17].However these publications are related to development of more coarse-grained parallel algorithms. Wedeveloped algorithms that work with the nonuniform memory access costs for the GPU global and localmemory, and a data parallel architecture. We used the hierarchical nature of the FMM to balance costtrees, and achieve optimal performance. This property of the FMM is very important as the different partsof the FMM can be accelerated with different degrees of efficiency, and at optimal settings, the overallperformance is controlled not only by the slowest step of algorithm. Finally, our implementation on the GPUalso suggested several improvements to the CPU FMM algorithm, which are also described.We implemented all the parts of the FMM algorithm on the GPU and obtained a properly scaled algorithmthat requires times of the order of 1 s to compute the matrix vector product for a million sized problem on acurrent NVIDIA GeForce 8800GTX GPU. This corresponds to effective flop rates of 24-43 Tflops, followingthe formula given in [21, 19].22 Fas t multipole method (FMM)While several papers and books describing the FMM are available, to keep the paper self contained andestablish notation we provide a brief introduction here. The basic task that the FMM performs is computationof potentials and its gradients (forces) generated by a set of sources and evaluated at a set of evaluation points(receivers)φj= φ (yj) =N i=1Φ (yj, xi) qi, j = 1, ..., M, xi, yj∈ Rd, (1)where {xi} are the sources, {yj} the receivers, qistrengths, and d is the dimensionality of the problem. Itcan also be used to compute the gradient∇φj= ∇yφ (y)|y=yj=N i=1∇yΦ (y, xi)|y=yjqi. (2)Obvious examples of such problems include particle interaction, star motion, and molecular dynamics. Atthe same time this

View Full Document