DOC PREVIEW
ISU CPRE 583 - lecture 10

This preview shows page 1-2 out of 6 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 6 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 6 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 6 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

1CprE / ComS 583Reconfigurable ComputingProf. Joseph ZambrenoDepartment of Electrical and Computer EngineeringIowa State UniversityLecture #10 – Systolic ComputingCprE 583 – Reconfigurable ComputingSeptember 21, 2006 Lect-10.2Quick Points• HW #2 due Tuesday at 12:00pm• Any questions?• HW #3 (9/28 – 10/17), project proposal (9/26 –10/5), midterm exam (10/12) coming soon CprE 583 – Reconfigurable ComputingSeptember 21, 2006 Lect-10.3Recap – Multi-FPGA Systems• Crossbar topology:• Devices A-D are routing only• Gives predictable performance• Potential waste of resources for near-neighbor connectionsA B CDW X Y ZCprE 583 – Reconfigurable ComputingSeptember 21, 2006 Lect-10.4Recap – Logic Emulation• Emulation takes a sizable amount of resources• Compilation time can be large due to FPGA compilesCprE 583 – Reconfigurable ComputingSeptember 21, 2006 Lect-10.5Recap – Virtual Wires• Overcome pin limitations by multiplexing pins and signals• Schedule when communication will take placeCprE 583 – Reconfigurable ComputingSeptember 21, 2006 Lect-10.6Outline• Recap• Introduction and Motivation• Common Systolic Structures• Algorithmic Mapping• Mapping Examples• Finite impulse response• Matrix-vector product• Banded matrix-vector product• Banded matrix multiplication2CprE 583 – Reconfigurable ComputingSeptember 21, 2006 Lect-10.7Systolic Computingsys·to·le (sǐs’tə-lē) n. – the rhythmic contraction of the heart, especially of the ventricles, by which blood is driven through the aorta and pulmonary artery after each dilation or diastole[Greek systolē, from systellein to contract, from syn- + stellein to send]– sys·tol·ic (sǐs-tõl’ǐk) adj.Data flows from memory in a rhythmic fashion, passing through many processing elements before it returns to memory.[Kung, 1982]CprE 583 – Reconfigurable ComputingSeptember 21, 2006 Lect-10.8Systolic Architectures• Goal – general methodology for mapping computations into hardware (spatial computing) structures• Composition:• Simple compute cells (e.g. add, sub, max, min)• Regular interconnect pattern• Pipelined communication between cells• I/O at boundariesxx+ xminx cCprE 583 – Reconfigurable ComputingSeptember 21, 2006 Lect-10.9Motivation• Effectively utilize VLSI• Reduce “Von Neumann Bottleneck”• Target compute-intensive applications• Reduce design cost• Simplicity • Regularity• Exploit concurrency• Local communication• Short wires (small delay, less area)• ScalableCprE 583 – Reconfigurable ComputingSeptember 21, 2006 Lect-10.10Why Study?• Original motivation – specialized accelerator for an application• Model/goals is a close match to reconfigurable computing• Target algorithms match• Well-developed theory, techniques, and solutions• One big difference – Kung’s approach targeted custom silicon (not a reconfigurable fabric)• Compute elements needed to be more generalCprE 583 – Reconfigurable ComputingSeptember 21, 2006 Lect-10.11• One-dimensional linear arrayCommon Systolic StructuresF1FNF3F2…F4F1,1F1,NF1,3F1,2…F1,4F2,1F2,NF2,3F2,2…F2,4FM,1FM,NFM,3FM,2…FM,4……………• Two-dimensional meshCprE 583 – Reconfigurable ComputingSeptember 21, 2006 Lect-10.12Hexagonal ArrayF1F3F6F10F5F9F13F15F12F16F8F14F2F11F4F7Communicates with six nearest neighborsSquared-up representationF1F4F2F7F3F8F5F11F10F15F13F16F6F12F9F143CprE 583 – Reconfigurable ComputingSeptember 21, 2006 Lect-10.13Binary TreeF1F2F3F4F5F6F7F8F9F10F11F12F13F14F15F1F2F3F4F5F6F7F12F13F14F15F8F9F10F11H-Tree RepresentationCprE 583 – Reconfigurable ComputingSeptember 21, 2006 Lect-10.14Mapping Approach• Allocate PEs• Schedule computation• Schedule PEs• Schedule data flow• Optimize• Available Transformations:• Preload repeated values• Replace feedback loops with registers• Internalize data flow• Broadcast common inputCprE 583 – Reconfigurable ComputingSeptember 21, 2006 Lect-10.15Example – Finite Impulse Response• A Finite Impulse Response (FIR) filter is a type of digital filter• Finite – response to an impulse eventually settles to zero• Requires no feedback∑=−+−++⋅=⋅++⋅+⋅=kjjiikikiiixwxwxwxwy111121 Lfor (i=1; i<=n; i++)for (j=1; j <=k; j++)y[i] += w[j] * x[i+j-1];CprE 583 – Reconfigurable ComputingSeptember 21, 2006 Lect-10.16FIR Attempt #1• Parallelize the outer loopfor (i=1; i<=n; i++) in parallelfor (j=1; j <=k; j++) sequentialy[i] += w[j] * x[i+j-1];wjxjy1y1wjxj+1y2y2wjxn+j-1ynynCprE 583 – Reconfigurable ComputingSeptember 21, 2006 Lect-10.17FIR Atttempt #1 (cont.)• Broadcast common inputsfor (i=1; i<=n; i++) in parallelfor (j=1; j <=k; j++) sequentialy[i] += w[j] * x[i+j-1];wjy1xjwjy2xj+1wjy3xj+2wjynxj+n-1…wjy1xjy2xj+1y3xj+2ynxj+n-1…CprE 583 – Reconfigurable ComputingSeptember 21, 2006 Lect-10.18FIR Attempt #1 (cont.)• Retime to eliminate broadcastfor (i=1; i<=n; i++) in parallelfor (j=1; j <=k; j++) sequentialy[i] += w[j] * x[i+j-1];wjy1xjy2xj+1y3xj+2ynxj+n-1…4CprE 583 – Reconfigurable ComputingSeptember 21, 2006 Lect-10.19FIR Attempt #1 (cont.)• Broadcast common valuesfor (i=1; i<=n; i++) in parallelfor (j=1; j <=k; j++) sequentialy[i] += w[j] * x[i+j-1];wjy1xi-1y2y3yn…CprE 583 – Reconfigurable ComputingSeptember 21, 2006 Lect-10.20FIR Attempt #2• Parallelize the inner loopfor (i=1; i<=n; i++) sequentialfor (j=1; j <=k; j++) in parallely[i] += w[j] * x[i+j-1];w1xiyiyiw2xi+1yiyiwkxi+k-1yiyiCprE 583 – Reconfigurable ComputingSeptember 21, 2006 Lect-10.21FIR Attempt #2 (cont.)• Internalize data flowfor (i=1; i<=n; i++) sequentialfor (j=1; j <=k; j++) in parallely[i] += w[j] * x[i+j-1];yiw1xi+1xi+2xi+k-1…yixiw2w3wkCprE 583 – Reconfigurable ComputingSeptember 21, 2006 Lect-10.22FIR Attempt #2 (cont.)• Allocation schedulefor (i=1; i<=n; i++) sequentialfor (j=1; j <=k; j++) in parallely[i] += w[j] * x[i+j-1];y1w1…x1w2w3wky2y3yn… y1y2y3yn…x2x3xi…x2x3x4xi+1…x3x4x5xi+2…xkxk+1xk+2xi+k-1…CprE 583 – Reconfigurable ComputingSeptember 21, 2006 Lect-10.23FIR Attempt #2 (cont.)• Preload repeated valuesfor (i=1; i<=n; i++) sequentialfor (j=1; j <=k; j++) in parallely[i] += w[j] * x[i+j-1];y1w1w2w3wk…x1y2y3yn… y1y2y3yn…x2x3xi…x2x3x4xi+1…x3x4x5xi+2…xkxk+1xk+2xi+k-1…CprE 583 – Reconfigurable ComputingSeptember 21, 2006 Lect-10.24FIR Attempt #2 (cont.)• Broadcast common valuesfor (i=1; i<=n; i++) sequentialfor


View Full Document

ISU CPRE 583 - lecture 10

Download lecture 10
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view lecture 10 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view lecture 10 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?