Unformatted text preview:

FPGAs vs. CPUs: Trends in Peak Floating-PointPerformanceKeith UnderwoodSandia National Laboratories∗PO Box 5800, MS-1110Albuquerque, NM [email protected] ore ’s Law states that the number of transistors on a de-vice doubles every two years; however, it is often (mis)quotedbased on its impact on CPU pe rformance. This importantcorollary of Moore’s Law states that improved clock fre-quency plus improved architecture yields a doubling of CPUperformance every 18 months. This paper examines the im-pact of Moore’s Law on the p eak floating-point performanceof FPGAs. Performance trends for individual operations areanalyzed as well as the performance trend of a common in-struction mix (multiply accumulate). The important resultis that peak FPGA floating-point performance is growingsignificantly faster than peak floating-point performance fora CPU.Categories and Subject DescriptorsC.4 [Performance of Systems]: Performance Attributes;C.1.3 [Other Architecture Styles]: Adaptable Architec-turesGeneral TermsPerformanceKeywordsFPGA, floating point, supercomputing, trends1. INTRODUCTIONThe consistency of Moore’s law has had a dramatic im-pact on the semiconductor industry. Advances are expectedto continue at the current pace for at least another decade∗Sandia is a multiprogram laboratory operated by SandiaCorporation, a Lockheed Martin Company, for the UnitedStates Department of Energy under contract DE-AC04-94AL85000.Copyright2004AssociationforComputingMachinery.ACMacknowl-edges that this contribution was authored or co-authored by a contractor oraffiliate of the U.S. Government. As such, the Government retains a nonex-clusive, royalty-free right to publish or reproduce this article, or to allowothers to do so, for Government purposes only.FPGA’04, February 22-24, 2004, Monterey, California, USA.Copyright 2004 ACM 1-58113-829-6/04/0002 ...$5.00.yielding feature sizes of 45 nm by 2009[1]. Every two yearsthe feature size for CMOS technology drops by over 40%.This translates into a doubling of transistors per unit areaand a doubling of clock frequency every two years. The mi-cropro ces sor industry has translated this into a doubling ofCPU performance every 18 months. Over the last six years,clo ck rate has yielded a 12× improvement in CPU perfor-mance while architectural changes have only yielded a 1× to4× improvement (depending on the operation considered).The additional transistors are typically used to compensatefor the “memory wall”[2]. Design constraints and legacyinstruction sets prevent CPU architects from moving thenecessary state closer to the computation; thus, additionalfunctional units would go underutilized.Unlike CPUs, FPGAs have a high degre e of hardwareconfigurability. Thus, while CPU designers must select a re-source allocation and a memory hierarchy that performs wellacross a range of applications, FPGA designers can leavemany of those choices to the application designer. Simul-taneously, the dataflow nature of computation implementedin field programmable gate arrays (FPGAs) overcomes someof the issues with the memory wall. There is no instruc-tion fetch and much more local state can be maintained (i.e.there is a larger “register set”). Thus, data retrieved frommemory is much more likely to stay in the FPGA until theapplication is “done” with it. As such, applications imple-mented in FPGAs are free to utilize the improvements inarea that accompany Moore’s law.Beginning with the Xilinx XC4085XL, it became possibleto implement IEEE compliant, double-precision, floating-point addition and multiplication; however, in that era, FP-GAs were much slower than commodity CPUs. Since then,FPGA floating-point p erformance has been increasing dra-matically. Indeed, the floating-point performance of FP-GAs has been increasing more rapidly than that of com-mo dity CPUs. Using the Moore’s law factors of 2× the areaand 2× the clock rate every two years, one would expect a4× increase in FPGA floating-point performance every twoyears. This is significantly faster than the 4× increase inCPU performance every three years. But area and clock rateare not the entire story. Architectural changes to FPGAshave the potential to accelerate (or decelerate) the improve-ment in FPGA floating-point performance. For example,the introduction of 18 × 18 multipliers into the Virtex-IIarchitecture dramatically reduce the area needed to build afloating-point multiplier. Conversely, the introduction of ex-tremely high speed I/O and embedded processors consumed171area that could have been used to implement additional pro-grammable logic1.These trends, coupled with the potential of FPGAs tosustain a higher percentage of p eak performance, promptedthis analysis of floating-point performance trends. Modernscience and engineering is becoming increasingly dependenton supercomputer simulation to reduce experimentation re-quirements and to offer insight into microscopic phenom-ena. Such scientific applications at Sandia National Labsdepend on IEEE standard, double precision operations. Infact, many of these applications depend on full IEEE compli-ance (including denormals) to maintain numerical stability.Thus, this paper presents the des ign of IEEE compliant sin-gle and double precision floating-point addition, multiplica-tion, and division. The performance and area requirementsof these operators, along with the multiply accumulate com-posite operator, is given for five FPGAs over the course of6 years. From this data, long term trend lines are plottedand compared against known CPU data for the same timeperiod. Each line is extrapolated to determine a long term“winner”.The remainder of this paper is organized as follows. Sec-tion 2 presents related work on floating-point in FPGAs.The implementation of the floating-point operations is thenpresented in Section 3. Section 4 presents the comparisonof FPGA and CPU performance trends. Finally, Section 5presents conclusions and Section 6 presents future work.2. BACKGROUNDThis work is motivated by an extensive body of prev iouswork in floating-point for FPGAs. A long series of work[3, 4,5, 6, 7] has investigated the use of custom floating-point for-mats in FPGAs. There has also been some work in the trans-lation of floating-point to fixed point[8] and the automaticoptimization of the bit widths of floating-point formats[9].In most cases, these formats are shown to be adequate forsome applications, to require


View Full Document

ISU CPRE 583 - Und04A

Download Und04A
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Und04A and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Und04A 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?