1CprE / ComS 583Reconfigurable ComputingProf. Joseph ZambrenoDepartment of Electrical and Computer EngineeringIowa State UniversityLecture #25 – High-Level CompilationCprE 583 – Reconfigurable ComputingNovember 28, 2006 Lect-25.2Quick Points26SundayDead Week3Finals Week101726Monday41118Lect-2528TuesdayProject Seminars (EDE)1512Electronic Grades Due1929Wednesday613Lect-26??30ThursdayProject Seminars (Others)7141Friday8152Saturday9Project Write-ups Deadline16December / November 2006CprE 583 – Reconfigurable ComputingNovember 28, 2006 Lect-25.3Project Deliverables• Final presentation [15-25 min]• Aim for 80-100% project completeness• Outline it as an extension of your report:• Motivation and related work• Analysis and approach taken• Experimental results and summary of findings• Conclusions / next steps• Consider details that will be interesting / relevant for the expected audience• Final report [8-12 pages] • More thorough analysis of related work• Minimal focus on project goals and organization• Implementation details and results• See proceedings of FCCM/FPGA/FPL for inspirationCprE 583 – Reconfigurable ComputingNovember 28, 2006 Lect-25.4• Processors efficient at sequential codes, regular arithmetic operations• FPGA efficient at fine-grained parallelism, unusual bit-level operations• Tight-coupling important: allows sharing of data/control• Efficiency is an issue:• Context-switches• Memory coherency• SynchronizationRecap – Reconfigurable CoprocessingCprE 583 – Reconfigurable ComputingNovember 28, 2006 Lect-25.5a31 a30………. a0b31 b0Swap bitpositionsInstruction Augmentation• Processor can only describe a small number of basic computations in a cycle • I bits -> 2Ioperations• Many operations could be performed on 2 W-bit words• ALU implementations restrict execution of some simple operations• e. g. bit reversalCprE 583 – Reconfigurable ComputingNovember 28, 2006 Lect-25.6Recap – PRISC [RazSmi94A] • Architecture:• couple into register file as “superscalar”functional unit• flow-through array (no state)2CprE 583 – Reconfigurable ComputingNovember 28, 2006 Lect-25.7Recap – Chimaera Architecture• Live copy of register file values feed into array• Each row of array may compute from register of intermediates• Tag on array to indicate RFUOPCprE 583 – Reconfigurable ComputingNovember 28, 2006 Lect-25.8PipeRench Architecture• Many application are primarily linear • Audio processing• Modified video processing• Filtering• Consider a “striped” architecture which can be very heavily pipelined• Each stripe contains LUTs and flip flops• Datapath is bit-sliced• Similar to Garp/Chimaera but standalone• Compiler initially converts dataflow application into a series of stripes• Run-time dynamic reconfiguration of stripes if application is too big to fit in available hardwareCprE 583 – Reconfigurable ComputingNovember 28, 2006 Lect-25.9PipeRench Internals• Only multi-bit functional units used• Very limited resources for interconnect to neighboring programming elements• Place and route greatly simplifiedCprE 583 – Reconfigurable ComputingNovember 28, 2006 Lect-25.10F1F2F3F4F5F1F6F3F4F5D1D2D3D4PipeRench Place-and-Route• Since no loops and linear data flow used, first step is to perform topological sort• Attempt to minimize critical paths by limiting NO-OP steps• If too many trips needed, temporally as well as spatially pipelineCprE 583 – Reconfigurable ComputingNovember 28, 2006 Lect-25.11CUSTOM:PipeRench FabricSTANDARD CELLS:Virtualization & Interface LogicConfiguration CacheData Store MemorySTRIPEPE• 3.6M transistors• Implemented in a commercial 0.18μ, 6 metal layer technology• 125 MHz core speed(limited by control logic)• 66 MHz I/O Speed• 1.5V core, 3.3V I/OPipeRench PrototypesCprE 583 – Reconfigurable ComputingNovember 28, 2006 Lect-25.12Parallel Computation• What would it take to let the processor and FPGA run in parallel?Modern ProcessorsDeal with:• Variable data delays• Dependencies with data• Multiple heterogeneous functional unitsVia:• Register scoreboarding• Runtime data flow (Tomasulo)3CprE 583 – Reconfigurable ComputingNovember 28, 2006 Lect-25.13OneChip• Want array to have direct memory→memoryoperations• Want to fit into programming model/ISA• Without forcing exclusive processor/FPGA operation• Allowing decoupled processor/array execution• Key Idea:• FPGA operates on memory→memory regions• Make regions explicit to processor issue• Scoreboard memory blocksCprE 583 – Reconfigurable ComputingNovember 28, 2006 Lect-25.14OneChip PipelineCprE 583 – Reconfigurable ComputingNovember 28, 2006 Lect-25.15• Basic Operation is:• FPGA MEM[Rsource]→MEM[Rdst]• block sizes powers of 2• Supports 14 “loaded” functions• DPGA/contexts so 4 can be cached• Fits well into soft-core processor modelOneChip InstructionsCprE 583 – Reconfigurable ComputingNovember 28, 2006 Lect-25.16OneChip (cont.)• Basic op is: FPGA MEM→MEM• No state between these ops• Coherence is that ops appear sequential• Could have multiple/parallel FPGA compute units• Scoreboard with processor and each other• Single source operations?• Can’t chain FPGA operations? CprE 583 – Reconfigurable ComputingNovember 28, 2006 Lect-25.170x00x10000x10000FPGAProcIndicates usage of data pages likevirtual memory system!OneChip Extensions• FPGA operates on certain memory regions only• Makes regions explicit to processor issue• Scoreboard memory blocksCprE 583 – Reconfigurable ComputingNovember 28, 2006 Lect-25.18Shadow Registers• Reconfigurable functional units require tight integration with register file• Many reconfigurable operations require more than two operands at a time4CprE 583 – Reconfigurable ComputingNovember 28, 2006 Lect-25.19Multi-Operand Operations• What’s the best speedup that could be achieved?• Provides upper bound• Assumes all operands available when neededCprE 583 – Reconfigurable ComputingNovember 28, 2006 Lect-25.20Additional Register File Access• Dedicated link – move data as needed• Requires latency• Extra register port –consumes resources• May not be used often• Replicate whole (or most) of register file• Can be wastefulCprE 583 – Reconfigurable ComputingNovember 28, 2006 Lect-25.21Shadow Register Approach• Small number of registers needed (3 or 4)• Use extra bits
View Full Document