1CprE / ComS 583Reconfigurable ComputingProf. Joseph ZambrenoDepartment of Electrical and Computer EngineeringIowa State UniversityLecture #10 – Systolic ComputingCprE 583 – Reconfigurable ComputingSeptember 21, 2006 Lect-10.2Quick Points• HW #2 due Tuesday at 12:00pm• Any questions?• HW #3 (9/28 – 10/17), project proposal (9/26 –10/5), midterm exam (10/12) coming soon CprE 583 – Reconfigurable ComputingSeptember 21, 2006 Lect-10.3Recap – Multi-FPGA Systems• Crossbar topology:• Devices A-D are routing only• Gives predictable performance• Potential waste of resources for near-neighbor connectionsA B CDW X Y ZCprE 583 – Reconfigurable ComputingSeptember 21, 2006 Lect-10.4Recap – Logic Emulation• Emulation takes a sizable amount of resources• Compilation time can be large due to FPGA compilesCprE 583 – Reconfigurable ComputingSeptember 21, 2006 Lect-10.5Recap – Virtual Wires• Overcome pin limitations by multiplexing pins and signals• Schedule when communication will take placeCprE 583 – Reconfigurable ComputingSeptember 21, 2006 Lect-10.6Outline• Recap• Introduction and Motivation• Common Systolic Structures• Algorithmic Mapping• Mapping Examples• Finite impulse response• Matrix-vector product• Banded matrix-vector product• Banded matrix multiplication2CprE 583 – Reconfigurable ComputingSeptember 21, 2006 Lect-10.7Systolic Computingsys·to·le (sǐs’tə-lē) n. – the rhythmic contraction of the heart, especially of the ventricles, by which blood is driven through the aorta and pulmonary artery after each dilation or diastole[Greek systolē, from systellein to contract, from syn- + stellein to send]– sys·tol·ic (sǐs-tõl’ǐk) adj.Data flows from memory in a rhythmic fashion, passing through many processing elements before it returns to memory.[Kung, 1982]CprE 583 – Reconfigurable ComputingSeptember 21, 2006 Lect-10.8Systolic Architectures• Goal – general methodology for mapping computations into hardware (spatial computing) structures• Composition:• Simple compute cells (e.g. add, sub, max, min)• Regular interconnect pattern• Pipelined communication between cells• I/O at boundariesxx+ xminx cCprE 583 – Reconfigurable ComputingSeptember 21, 2006 Lect-10.9Motivation• Effectively utilize VLSI• Reduce “Von Neumann Bottleneck”• Target compute-intensive applications• Reduce design cost• Simplicity • Regularity• Exploit concurrency• Local communication• Short wires (small delay, less area)• ScalableCprE 583 – Reconfigurable ComputingSeptember 21, 2006 Lect-10.10Why Study?• Original motivation – specialized accelerator for an application• Model/goals is a close match to reconfigurable computing• Target algorithms match• Well-developed theory, techniques, and solutions• One big difference – Kung’s approach targeted custom silicon (not a reconfigurable fabric)• Compute elements needed to be more generalCprE 583 – Reconfigurable ComputingSeptember 21, 2006 Lect-10.11• One-dimensional linear arrayCommon Systolic StructuresF1FNF3F2…F4F1,1F1,NF1,3F1,2…F1,4F2,1F2,NF2,3F2,2…F2,4FM,1FM,NFM,3FM,2…FM,4……………• Two-dimensional meshCprE 583 – Reconfigurable ComputingSeptember 21, 2006 Lect-10.12Hexagonal ArrayF1F3F6F10F5F9F13F15F12F16F8F14F2F11F4F7Communicates with six nearest neighborsSquared-up representationF1F4F2F7F3F8F5F11F10F15F13F16F6F12F9F143CprE 583 – Reconfigurable ComputingSeptember 21, 2006 Lect-10.13Binary TreeF1F2F3F4F5F6F7F8F9F10F11F12F13F14F15F1F2F3F4F5F6F7F12F13F14F15F8F9F10F11H-Tree RepresentationCprE 583 – Reconfigurable ComputingSeptember 21, 2006 Lect-10.14Mapping Approach• Allocate PEs• Schedule computation• Schedule PEs• Schedule data flow• Optimize• Available Transformations:• Preload repeated values• Replace feedback loops with registers• Internalize data flow• Broadcast common inputCprE 583 – Reconfigurable ComputingSeptember 21, 2006 Lect-10.15Example – Finite Impulse Response• A Finite Impulse Response (FIR) filter is a type of digital filter• Finite – response to an impulse eventually settles to zero• Requires no feedback∑=−+−++⋅=⋅++⋅+⋅=kjjiikikiiixwxwxwxwy111121 Lfor (i=1; i<=n; i++)for (j=1; j <=k; j++)y[i] += w[j] * x[i+j-1];CprE 583 – Reconfigurable ComputingSeptember 21, 2006 Lect-10.16FIR Attempt #1• Parallelize the outer loopfor (i=1; i<=n; i++) in parallelfor (j=1; j <=k; j++) sequentialy[i] += w[j] * x[i+j-1];wjxjy1y1wjxj+1y2y2wjxn+j-1ynynCprE 583 – Reconfigurable ComputingSeptember 21, 2006 Lect-10.17FIR Atttempt #1 (cont.)• Broadcast common inputsfor (i=1; i<=n; i++) in parallelfor (j=1; j <=k; j++) sequentialy[i] += w[j] * x[i+j-1];wjy1xjwjy2xj+1wjy3xj+2wjynxj+n-1…wjy1xjy2xj+1y3xj+2ynxj+n-1…CprE 583 – Reconfigurable ComputingSeptember 21, 2006 Lect-10.18FIR Attempt #1 (cont.)• Retime to eliminate broadcastfor (i=1; i<=n; i++) in parallelfor (j=1; j <=k; j++) sequentialy[i] += w[j] * x[i+j-1];wjy1xjy2xj+1y3xj+2ynxj+n-1…4CprE 583 – Reconfigurable ComputingSeptember 21, 2006 Lect-10.19FIR Attempt #1 (cont.)• Broadcast common valuesfor (i=1; i<=n; i++) in parallelfor (j=1; j <=k; j++) sequentialy[i] += w[j] * x[i+j-1];wjy1xi-1y2y3yn…CprE 583 – Reconfigurable ComputingSeptember 21, 2006 Lect-10.20FIR Attempt #2• Parallelize the inner loopfor (i=1; i<=n; i++) sequentialfor (j=1; j <=k; j++) in parallely[i] += w[j] * x[i+j-1];w1xiyiyiw2xi+1yiyiwkxi+k-1yiyiCprE 583 – Reconfigurable ComputingSeptember 21, 2006 Lect-10.21FIR Attempt #2 (cont.)• Internalize data flowfor (i=1; i<=n; i++) sequentialfor (j=1; j <=k; j++) in parallely[i] += w[j] * x[i+j-1];yiw1xi+1xi+2xi+k-1…yixiw2w3wkCprE 583 – Reconfigurable ComputingSeptember 21, 2006 Lect-10.22FIR Attempt #2 (cont.)• Allocation schedulefor (i=1; i<=n; i++) sequentialfor (j=1; j <=k; j++) in parallely[i] += w[j] * x[i+j-1];y1w1…x1w2w3wky2y3yn… y1y2y3yn…x2x3xi…x2x3x4xi+1…x3x4x5xi+2…xkxk+1xk+2xi+k-1…CprE 583 – Reconfigurable ComputingSeptember 21, 2006 Lect-10.23FIR Attempt #2 (cont.)• Preload repeated valuesfor (i=1; i<=n; i++) sequentialfor (j=1; j <=k; j++) in parallely[i] += w[j] * x[i+j-1];y1w1w2w3wk…x1y2y3yn… y1y2y3yn…x2x3xi…x2x3x4xi+1…x3x4x5xi+2…xkxk+1xk+2xi+k-1…CprE 583 – Reconfigurable ComputingSeptember 21, 2006 Lect-10.24FIR Attempt #2 (cont.)• Broadcast common valuesfor (i=1; i<=n; i++) sequentialfor
View Full Document