UW-Madison ECE 734 - Exploring realizations of large integer multipliers using embedded blocks in modern FPGAs - D328871

Home> Schools> University of Wisconsin, Madison> Electrical and Computer Engr (ECE) > ECE 734> Exploring realizations of large integer multipliers using embedded blocks in modern FPGAs

DOC PREVIEW

UW-Madison ECE 734 - Exploring realizations of large integer multipliers using embedded blocks in modern FPGAs

School name University of Wisconsin, Madison

Course Ece 734- VLSI Array Structures for Digital Signal Processing

Pages 21

This preview shows page 1-2-20-21 out of 21 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 21 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 21 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 21 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 21 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 21 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Slide 1Slide 2Slide 3Slide 4Slide 6Slide 7Slide 8Slide 9Slide 10Slide 11Slide 12Slide 13Slide 13Slide 14Slide 15Slide 16Slide 17Slide 18Slide 19Slide 20Slide 21Exploring realizations of large integer multipliers using embedded blocks in modern FPGAsShreesha SrinathOutline• Motivation •Problem proposal•Related work• Design description •High level design description• Example and results •Performance comparison• Conclusions2Motivation: Problem Proposal3The Problem:• HPCWire Article “A paper-and-pencil analysis of FPGA peak floating-point performance”• Modern FPGAs provide the designer with hardwired mutliplier blocks. • How to build large integer multiplier using these embedded blocks?Solution:• Provide a systematic approach for implementation of multiplication.• Design equations which can be converted into constraint graphs which the designer can analyze given the resources.Motivation: Related Work4• Optimised realisations of large integer multipliers and squarers using embedded blocks – Gao et. al - Systematic approach for “symmetric blocks” • Large multipliers with less DSP blocks – LIP research report. - “Karatsuba-Ofman” Algorithm - Non-Standard Multiplier Tiling - Limited to double-precision floating point multiplication• The Modern FPGAs provide “asymmetric multiplier” blocks• Solution addresses these resources in a general fashionDesign Description: High Level5• Assume two numbers X & Y which are k-bit wide• Assume a basic asymmetric block of size m x n bits with m<n;• We need to compute the multiplication Z = X*Y• Step 1: Split the numbers into m and n sized chunks• Split X into m-sized chunks and Y into n-sized chunks• Number of chunks for X = ceiling(k/n) = p1• Number of chunks for Y = ceiling(k/m) = p2• X = [ Xp1-1 Xp1-2 ...... X0 ]2^n • Y = [ Yp2-1 Yp2-2 ...... Y0 ]2^m• X = 2^(n*(p1-1))* Xp1-1 + 2^(n*(p1-2))* Xp1-2 + ...... + X0• Y = 2^(n*(p2-1))* Yp2-1 + 2^(n*(p2-2))* Yp2-2 + ...... + Y0Step 1 : Obtaining the chunks6Step 2 : Multiplication• Computing the multiplication :Z = X*Y• Substituting the chunks we getZ = ([ Xp1-1 Xp1-2 ...... X0 ]2^n ) * ([ Yp2-1 Yp2-2 ...... Y0 ]2^m)Z = (2^(n*(p1-1))* Xp1-1 + 2^(n*(p1-2))* Xp1-2 + ...... + X0) * (2^(m*(p2-1))* Yp2-1 + 2^(m*(p2-2))* Yp2-2 + ...... + Y0)• Each product is of length m+n-1 bits and in general : PPij = Xp1-i * Yp2-j = (2^(n*(p1-i)+(m*(p2-j))* Xp1-i* Yp2-j 7Grouping the Partial Products8• The partial products obtained by the above computation are then grouped and added to obtain the result of the computation.• How do we group them? And does this matter?• Grouping: - Horizontally - Vertically - DiagonallyExample9• Consider numbers X & Y such that, X = [x2 x1 x0] Y = [y3 y2 y1 y0]• Z = X*Y = [x2 x1 x0]*[y3 y2 y1 y0] = (2^2n.x2 + 2^n.x1 + x0)*( 2^3m.y3 + 2^2m.y2 + 2^m.y1 + y0) = 2^2n.(x2.y0) + 2^2n.(x1.y0) + (x0.y0) + 2^(2n+m).(x2.y1) + 2^(n+m).(x1.y1) + 2^(m).(x0.y0) + 2^(2n+2m).(x2.y2) + 2^(n+2m).(x1.y2) + 2^(2m).(x0.y2) + 2^(2n+3m).(x2.y3) + 2^(n+3m).(x1.y3) + 2^(3m).(x0.y3) = a1 + a2 + a3 + b1+ b2 + b3 + c1+ c2 + c3 + d1+ d2 + d3Grouping Horizontally• Partial Products Obtained are: PP1: (a1 + a2 + a3) PP2: (b1 + b2 + b3) PP3: (c1 + c2 + c3) PP4: (d1 + d2 + d3)Z = PP1 + PP2 + PP3 + PP4 10Translate into Constraint Graph11• Organization for timing• Assuming the cycles for multiplication and addition to be 1• Cost of the implementation is 12 multiplications 12 additions• Result obtained after 7 cyclesGrouping Vertically12Partial Products Obtained are: PP1: (a1 + b1 + c1 +d1) PP2: (a2 + b2 + c2 +d2) PP3: (a3 + b3 + c3 +d3)Z = PP1 + PP2 + PP3Translate into Constraint Graph13• Organization for timing• Assuming the cycles for multiplication and addition to be 1• Cost of the implementation is 12 multiplications 12 additions but larger adders!• Result obtained after 7 cyclesGrouping Diagonally14Partial Products Obtained are: PP1: (a1) PP2: (b1 + a2) PP3: (c1 + b2 + a3) PP4: (d1 + c2 + b3) PP5: (d2 + c3) PP6: (d3)Z = PP1 + PP2 + PP3 + PP4 + PP5 + PP6Why Group Diagonally?15Grouping the products diagonally is not an addition it is merely a concatenation. PP3: (c1 + b2 + a3)= x2y2 + x1y1 + x0y0What other optimization can be done for better timing and area?Why Group Diagonally?16 Deferred Parallel Carry addition of partial product• To optimize the multiplier with an emphasis on area saving. The idea here is a set of carry bits generated from various levels of additions are combined and processed later.• Keep note of the position where carry needs to be generated and insert zeros in between•Caveat : If the position of carry generated by any two levels overlap then if the number is two handled by simple hardware and if more the optimization is not usefulWhy Group Diagonally?17Another advantage of going diagonally.• The adder size at each level increases logarithmically. Stage 1: Largest adder – m+nStage 2: Largest adder – 2(m+n)Stage 3: Largest adder – 3(m+n)•••••Stage Q: Largest adder – Q(m+n)18Translate into Constraint Graph• Best Organization for timing and area• Assuming the cycles for multiplication and addition to be 1• Cost of the implementation is 12 multiplications 6 additions but lesser in area!• Result obtained after 5 cyclesImplementation results19 Three design implemented targeted for Xilinx Virtex-5 family of devicesThe multiplier block available of size : 24 x 17Multiplier capable of handling multiplications of size : 52-68 bits• Baseline design with no deferred carry and horizontal grouping• Horizontal grouping with deferred carry additions• Diagonal grouping with deferred carry additionsImplementation results20Conclusions• Contributions- Systematic methodology presented to build larger multipliers using asymmetric building blocks- Design equations when transformed to constraint graphs help the designer decide the best schedule- The design can further yield better frequencies when pipelined- The proposed approach provides the best design for timing and area

View Full Document