DOC PREVIEW
ISU CPRE 583 - Lect-05

This preview shows page 1-2 out of 5 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 5 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 5 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 5 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

1CprE / ComS 583Reconfigurable ComputingProf. Joseph ZambrenoDepartment of Electrical and Computer EngineeringIowa State UniversityLecture #5 – FPGA ArithmeticCprE 583 – Reconfigurable ComputingSeptember 5, 2006 Lect-05.2Quick Points• HW #1 due Thursday at 12:00pm• Any comments?CprE 583 – Reconfigurable ComputingSeptember 5, 2006 Lect-05.3Recap• Cluster size of N = [6-8] is good, K = [4-5]CprE 583 – Reconfigurable ComputingSeptember 5, 2006 Lect-05.4LUT Mapping TechniquesCprE 583 – Reconfigurable ComputingSeptember 5, 2006 Lect-05.5LUT Mapping Techniques (cont.)CprE 583 – Reconfigurable ComputingSeptember 5, 2006 Lect-05.6LUT Mapping Techniques (cont.)2CprE 583 – Reconfigurable ComputingSeptember 5, 2006 Lect-05.7Outline• Recap• Motivation• Carry / Cascade Logic• Addition• Ripple Carry• Carry Bypass• Carry Select• Carry Lookahead• Basic MultipliersCprE 583 – Reconfigurable ComputingSeptember 5, 2006 Lect-05.8Motivation• Traditional microprocessors, DSPs, etc. don’t use LUTs• Instead use a w-bit Arithmetic and Logic Unit (ALU)• Carry connections are hard-wired• No switches, no stubs, short wires(1)AND2OR2XOR2(2)ADDSUBCMP3-LUT 3-LUT3-LUT3-LUT(2)ALUABOutOp(1, 2)2-LUTABOut(1)A B CinSumCoutSumCout / CinABCprE 583 – Reconfigurable ComputingSeptember 5, 2006 Lect-05.9Adder Delays• Assuming a ripple-carry adder:• 32-bit ALU delay – 6ns• 32-cascaded 4-LUTs – 32 x 2.5ns = 80ns• Motivates extra hardware to accelerate carry operations2.0 ns0.5 ns2.5 nsLogic delaySingle channel delayper bit4-LUT delay16 ns6 nsArea optimizedDelay optimizedCompare: 32-bit ALU (0.6λ)CprE 583 – Reconfigurable ComputingSeptember 5, 2006 Lect-05.10Altera Flex 8000 Carry ChainCprE 583 – Reconfigurable ComputingSeptember 5, 2006 Lect-05.11Xilinx XC4000 Carry ChainCprE 583 – Reconfigurable ComputingSeptember 5, 2006 Lect-05.12Cascades• Large fanin operations (reductions):• Decoding• Matching• Completion detection• Many-to-one reductions• Combining logic is simple • Makes use of dedicated paths3CprE 583 – Reconfigurable ComputingSeptember 5, 2006 Lect-05.13Altera Cascade Logic• LE delay – 2.4 ns• Cascade delay – 0.6 nsCprE 583 – Reconfigurable ComputingSeptember 5, 2006 Lect-05.14Why Look at Arithmetic?• Parallelization • Specialization• Architecture• Size• Inputs• Adder problem – delay grows linearly with bit width• Solutions for larger adders:• Pipelining• Carry bypass• Carry selectCprE 583 – Reconfigurable ComputingSeptember 5, 2006 Lect-05.15Adder Pipelining• Not as practical in ASIC world (registers are expensive)• Registers essentially “free” in FPGA logic blocks+A1B1S1+CoutA2B2S2+A3B3S3CoutCoutCprE 583 – Reconfigurable ComputingSeptember 5, 2006 Lect-05.16Carry Bypass Adders• If all the propagates are 1 (P0P1P2P3= 1) then Cout3= Cin• Pi= Aixor Bi• Skip all the carry logic• Inexpensive+A0B0Cin+A1B1+A2B2+A3B3Cout0Cout1Cout301Cout2P0P1P2P3CprE 583 – Reconfigurable ComputingSeptember 5, 2006 Lect-05.17Carry Bypass Performance• Small hardware cost:• 16-bit add – 4 CLB overhead• 32-bit add – 9 CLB overhead• Delay growth still linear, smaller slopeDelayRipple Carry Carry Bypass N-bit AdderCprE 583 – Reconfigurable ComputingSeptember 5, 2006 Lect-05.18Carry Select Adders• Precompute addition value for (Cin = 0) case and (Cin = 1) case• Use mux to select between two with actual Cin value• Cost of this approach?+A0B0+A1B1+A2B2Cin+A4B4+A4B401+A3B3+A5B5+A6B6+A7B7+A5B5+A6B6+A7B7S4-7014CprE 583 – Reconfigurable ComputingSeptember 5, 2006 Lect-05.19Linear Carry-Select• Adder delay = w, mux delay = 1+A31-24B31-2401+01S31-24+A23-16B23-1601+01S23-16+A15-8B15-801+1S15-8+A7-0B7-001+1S7-0t8t8t8t8t8t8t8t8t9t10t11t1200CprE 583 – Reconfigurable ComputingSeptember 5, 2006 Lect-05.20Square-Root Carry Select• Each carry arrives when the corresponding sum is ready01++01A31-30B31-30S31-3001++01A29-22B29-22S29-2201++01A21-15B21-15S21-1501++01A14-9B14-9S14-901++01A8-4B8-4S8-401++01A3-0B3-0S3-0t4t4t5t5t5t6t6t7t7t8t8t6t7t8t9t10CprE 583 – Reconfigurable ComputingSeptember 5, 2006 Lect-05.21Constant Addition• If one operand is constant:• More speed?• Less hardware?HAA00S0FAA1S1FAA2S2FAA3S31 0 1C3A0S0HAA2S2HAA3S3C3A1S1CprE 583 – Reconfigurable ComputingSeptember 5, 2006 Lect-05.22Multiplication• Shift and add operations• Need N bit adder, M cycles42 = 101010 Multiplicand (N bits)x 10 = x 1010 Multiplier (M bits) 000000 101010 Partial products000000+ 101010420 = 111001110 Product (N+M bits)CprE 583 – Reconfigurable ComputingSeptember 5, 2006 Lect-05.23Array Multiplier• Area – N x M cells• Delay – O(N+M)+X0Y0X1X2X3Z0Y1X0+X1+X2+X3Z1+Y2X0+X1+X2+X3+Y3X0+X1+X2+X3Z2CprE 583 – Reconfigurable ComputingSeptember 5, 2006 Lect-05.24Carry-Save+X0Y0X1X2X3Z0Y1X0+X1+X2+X3Z1+Y2++++Y3+++Z25CprE 583 – Reconfigurable ComputingSeptember 5, 2006 Lect-05.25Multiplier Pipelining• Register cost:• Multiplicand – (N bits/stage x M stages)• Multiplier – (M2+ M) / 2 bits• Early output values – (M2+ M) / 2 bits• Total – M x (N + M + 1) bits• Critical path = max:• DFF + FA + setup• Bottom-level adderCprE 583 – Reconfigurable ComputingSeptember 5, 2006 Lect-05.26Constant MultiplierY0=0Z0X0X1X2X3Z1+Y3=1X0+X1+X2+X3Z2• Can greatly reduce the number of adders• Removes all and gatesY1=1Y2=0CprE 583 – Reconfigurable ComputingSeptember 5, 2006 Lect-05.27LUT-based Constant Multipliers• k-LUT can perform constant multiply of k-bit number• Break operand into k-bit quantities• Example: 8-bit x 8-bit constant, k=410101011x NNNNNNNNAAAAAAAAAAA (N * 1011 (LSN))+ BBBBBBBBBBBBBBB (N * 1010 (MSN))SSSSSSSSSSSSSSS ProductCprE 583 – Reconfigurable ComputingSeptember 5, 2006 Lect-05.28Summary• Latency overhead of programmable logic• Several approaches to reducing design latency:• Fast carry• Cascade• Hardwired connections• Multiplier optimization goals different from adder• Other techniques:• Logarithmic v. linear (Wallace Tree multiplier)• Data encoding (Booth’s


View Full Document

ISU CPRE 583 - Lect-05

Download Lect-05
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lect-05 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lect-05 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?