UW-Madison ECE 734 - Exploring realizations of large integer multipliers using embedded blocks in modern FPGAs - D442496

Home> Schools> University of Wisconsin, Madison> Electrical and Computer Engr (ECE) > ECE 734> Exploring realizations of large integer multipliers using embedded blocks in modern FPGAs

DOC PREVIEW

UW-Madison ECE 734 - Exploring realizations of large integer multipliers using embedded blocks in modern FPGAs

School name University of Wisconsin, Madison

Course Ece 734- VLSI Array Structures for Digital Signal Processing

Pages 4

This preview shows page 1 out of 4 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 4 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 4 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

ECE 734 Project ProposalExploring realizations of large integer multipliers using embedded blocks in modern FPGAs.Shreesha SrinathMotivationMultiplication functions constitute the kernel of many real-life applications. They are used extensively in applications such as digital signal processing, image processing, cryptography and multimedia [1,2,3]. Recent computing oriented FPGAs feature embedded DSP blocks including small embedded multipliers. Achieving efficient realization of multiplication may have significant impact on the specific application in terms of speed, power dissipation and area.FPGA vendors are now offering hardwired multipliers as one of the resources available to designers. Examples could be that of Xilinx Spartan-3 Family which includes 104 on-chip 18x18 multipliers and Xilinx Virtex-5 & 6 Family which include 25x18 multipliers. Optimized realizations of large multipliers of large integer multipliers using such blocks are studied in [4,5,6]. This project aims to study different approaches to implement large integer multipliers on DSP blocks in an efficient manner in terms of both timing and area.Related WorkIn [4,5], the authors present an efficient design methodology and systematic approach for implementation of multiplication and squaring functions. They propose a general architecture and a set of equations are derived to aid in realization. The method used is that of the “Divide and Conquer Algorithm” [7] with efficient organization of partial products. The symmetric embedded block considered is of size “n X n” and operands of size “k” such that (n X (m-1)) < k ≤ (n X m). The Table 1, below gives the sizes, in bits, of the partial products in the multiplication expression of two operands “X and Y”.The authors then look at timing and area efficient organization of the additions of partial products including the method of deferred parallel carry addition of partial products in which the set of carry bits generated from various levels of partial product additions are combined and processed later.In [6], the authors note the use of Karatsuba-Ofman algorithm [8], and present detailed methodology of splitting the operations by use of the algorithms has been proposed to implement large multiplication using less number of DSP blocks which again target the “n X n” basic embedded multipliers. The authors mention that no reference to Karatsuba-Ofman algorithm for integer multiplication in was found in FPGA literature.Project HighlightThe prior studies deal with the implementation of large multipliers using symmetric “n X n” embedded blocks. The goal of this project is to study the previous methods of [4,5,6] and develop a set of equations to assist the design and implementation of a large multiplier using asymmetric “m X n” embedded multipliers which are now available in the Xilinx Virtex-5 and 6 families. The approach is different to prior work and is novel as it deals with the asymmetric hardwired multipliers. An example of the approach is as given below. An example implementation based on the proposed approach is to be implemented on an FPGA device.The DSP48E block in the Xilinx Virtex-5 & 6 contain 25 x 18 multipliers. Consider the multiplication of two numbers X and Y both sized k, which are split into the following chunks .X = [x2 x1 x0] and Y = [y3 y2 y1 y0]; n > m;X = 2^2n.x2 + 2^n.x1 + x0;Y = 2^3m.y3 + 2^2m.y2 + 2^m.y1 + y0;Consider the Multiplication Z =X*Y. Substituting X and Y we get,Z = (2^2n.x2 + 2^n.x1 + x0)*( 2^3m.y3 + 2^2m.y2 + 2^m.y1 + y0).Z = 2^2n.(x2.y0) + 2^2n.(x1.y0) + (x0.y0)+ 2^(2n+m).(x2.y1) + 2^(n+m).(x1.y1) + 2^(m).(x0.y0)+ 2^(2n+2m).(x2.y2) + 2^(n+2m).(x1.y2) + 2^(2m).(x0.y2)+ 2^(2n+3m).(x2.y3) + 2^(n+3m).(x1.y3) + 2^(3m).(x0.y3)Replacing the terms with a(1...4), b(1...4), c(1...4) & d(1...4) as below:Z = a1 + a2 + a3 + b1+ b2 + b3 + c1+ c2 + c3 + d1+ d2 + d3.By observing the sizes of the products we can save the number of additions by grouping terms (c1,b2,a3) = A, (d1,c2,b3) = B , (b1,a2) =C , (d2,b3) =D and terms a1 =E, d3 =F remain. We save the additions as the terms grouped are effectively a simple concatenation and not additions The remaining terms are now indicated by letters A,B,C,D,E,F. The result of the multiplication can now be obtained by clever additions of these terms by clever organization of the additions such that we save area andoptimize for speed. The method of deferred parallel carry addition of partial products can now be applied to achieve this.ApproachThe approach to be followed is as below:1.) The prior work considers the case when the number of hardwired multipliers are available in abundance and hence the multiplication is a highly parallel operation. A paper-and-pencil analysis of FPGA peak floating-point performance [9] clearly shows that DSP blocks are a relatively scarce resource when one wants to use them for accelerating multiplications. Hence when considering the development of the approach the limited number of multipliers would be considered and hence a multi-cycle architecture will be proposed.2.) The set of equations would be developed to aid a designer to use the asymmetric embedded multipliers to implement large multiplication operation on FPGAs using the DSP blocks and the Xilinx Virtex-5 and 6 would be targeted.3.) An example implementation would be demonstrated and proved via FPGA implementation on a specific FPGA device belonging to the Xilinx Virtex-5 family and the analysis of resource usage, speed of operation, area occupied and power consumption would be carried out.Expected ResultsThe set of equations which would be developed would be useful for a designer to efficiently use the DSP embedded blocks to implement large multiplications. The systematic design approach is targeted to result in an timing and area efficient implementation.References[1] Walters, E. III, Arnold, M.G., and Schulte, M.J.: “Using truncated multipliers in DCT and IDCT hardware accelerators”. Proc. SPIE Advanced Signal Processing Algorithms, Architectures, and Implementations XIII, San Diego, California, August 2003, pp. 573–584.[2] Sheu, M.-H., and Lin, S.-H.: “Fast compensative design approach of the approximate squaring function”, IEEE J. Solid State Circuits, 2002, 37, (1), pp. 95–97.[3] Stallings, W.: “Cryptography and network

View Full Document