TitleLong-term GoalCustomer WishesFulfilling Customer NeedsNext Steps of ActionSlide 6Slide 7Slide 8Slide 9Slide 10Slide 11Slide 12Slide 13Slide 14Dan Padgett12/17/2009Computation of Pi using CUDADan PadgettUniversity at BuffaloDan Padgett12/17/2009Background●Want to find a way of utilizing CUDA to help improve times for computing digits of pi●First attempt used numerical integration–Proved to be unhelpful–Rate of ConvergenceDan Padgett12/17/2009Obstacles●Original series converged too slowly●Only double precision supported under CUDA 1.3 compute capabilityDan Padgett12/17/2009Overcoming our Obstacles●Found new series with fast convergence=∑i=0∞116i48i1−28i4−18i5−18i6Dan Padgett12/17/2009Next Steps of Action●Implemented new series–Sum converged to full precision in 8 itera-tions–Looked for higher precision library●Why has no one written one for CUDA?●We will soon find out...Dan Padgett12/17/2009Implementing Higher Precision●Started with sequential C●Modeled after IEEE 754 floating point specs●Left precision as #define variable●Was able to compute precisions up to 2600 integers per number on a worker nodeDan Padgett12/17/2009Stop... CUDA Time!●Compiled vanilla C source in nvcc CUDA compiler●Several issues–Incompatible low-level memory hacks–CUDA functions using structs are inlined–Limited CUDA memory, registersDan Padgett12/17/2009CUDA Difficulties●Replaced memory hacks with new memory hacks (maximum memset, extracting bits)●Other issues not satisfyingly resolvable–Inlining →10 minute compile time–Executable size neared 1MB–Limited shared memory → limited precisionDan Padgett12/17/2009Other Difficulties●Using higher precisions caused the compiler to simply crash●Found precision = 12 uses maximum number of CUDA registers●Nowhere near the capability of the sequential codeDan Padgett12/17/2009Results Cont.●After the usual 6-8 second CUDA initialization time, code ran far faster than sequential equivalent (up to number parallel processors)●Asymptotic behavior was as desired, even though the approximation wasn't as good as desired.Dan Padgett12/17/20090 20000 40000 60000 80000 100000 120000 140000020406080100120140160Summation Terms vs. RuntimeCUDASequential# Summation TermsTime (Seconds)Dan Padgett12/17/2009Accuracy of Approximation5 10 15 20 25 30 353.141592653588400000003.141592653588600000003.141592653588800000003.141592653589000000003.141592653589200000003.141592653589400000003.141592653589600000003.141592653589800000003.14159265359000000000pi approx# Summation TermsValue of ApproximationDan Padgett12/17/2009CUDA Runtime vs Number of Sum TermsLog Scale!1000 10000 100000 100000002468101214161820# Summation TermsTime (Seconds)Dan Padgett12/17/2009Conclusions●CUDA is not well-suited to problems which require a moderate amount of memory●For pure computation, CUDA offers enormous speedups through
View Full Document