Unformatted text preview:

TitleLong-term GoalCustomer WishesFulfilling Customer NeedsNext Steps of ActionSlide 6Slide 7Slide 8Slide 9Slide 10Slide 11Slide 12Slide 13Slide 14Dan Padgett12/17/2009Computation of Pi using CUDADan PadgettUniversity at BuffaloDan Padgett12/17/2009Background●Want to find a way of utilizing CUDA to help improve times for computing digits of pi●First attempt used numerical integration–Proved to be unhelpful–Rate of ConvergenceDan Padgett12/17/2009Obstacles●Original series converged too slowly●Only double precision supported under CUDA 1.3 compute capabilityDan Padgett12/17/2009Overcoming our Obstacles●Found new series with fast convergence=∑i=0∞116i48i1−28i4−18i5−18i6Dan Padgett12/17/2009Next Steps of Action●Implemented new series–Sum converged to full precision in 8 itera-tions–Looked for higher precision library●Why has no one written one for CUDA?●We will soon find out...Dan Padgett12/17/2009Implementing Higher Precision●Started with sequential C●Modeled after IEEE 754 floating point specs●Left precision as #define variable●Was able to compute precisions up to 2600 integers per number on a worker nodeDan Padgett12/17/2009Stop... CUDA Time!●Compiled vanilla C source in nvcc CUDA compiler●Several issues–Incompatible low-level memory hacks–CUDA functions using structs are inlined–Limited CUDA memory, registersDan Padgett12/17/2009CUDA Difficulties●Replaced memory hacks with new memory hacks (maximum memset, extracting bits)●Other issues not satisfyingly resolvable–Inlining →10 minute compile time–Executable size neared 1MB–Limited shared memory → limited precisionDan Padgett12/17/2009Other Difficulties●Using higher precisions caused the compiler to simply crash●Found precision = 12 uses maximum number of CUDA registers●Nowhere near the capability of the sequential codeDan Padgett12/17/2009Results Cont.●After the usual 6-8 second CUDA initialization time, code ran far faster than sequential equivalent (up to number parallel processors)●Asymptotic behavior was as desired, even though the approximation wasn't as good as desired.Dan Padgett12/17/20090 20000 40000 60000 80000 100000 120000 140000020406080100120140160Summation Terms vs. RuntimeCUDASequential# Summation TermsTime (Seconds)Dan Padgett12/17/2009Accuracy of Approximation5 10 15 20 25 30 353.141592653588400000003.141592653588600000003.141592653588800000003.141592653589000000003.141592653589200000003.141592653589400000003.141592653589600000003.141592653589800000003.14159265359000000000pi approx# Summation TermsValue of ApproximationDan Padgett12/17/2009CUDA Runtime vs Number of Sum TermsLog Scale!1000 10000 100000 100000002468101214161820# Summation TermsTime (Seconds)Dan Padgett12/17/2009Conclusions●CUDA is not well-suited to problems which require a moderate amount of memory●For pure computation, CUDA offers enormous speedups through


View Full Document

UB CSE 710 - Computation of Pi using CUDA

Download Computation of Pi using CUDA
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Computation of Pi using CUDA and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Computation of Pi using CUDA 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?