HSRA:High-Speed, Hierarchical Synchronous Reconfigurable ArrayWilliam Tsu, Kip Macy, Atul Joshi, Randy HuangNorman Walker, Tony Tung, Omid Rowhani, Varghese GeorgeJohn Wawrzynek, and Andr´e DeHonBerkeley Reconfigurable, Architectures, Software, and SystemsComputer Science DivisionUniversity of California at BerkeleyBerkeley, CA 94720-1776contact: <[email protected]>There isnoinherentcharacteristicforcing FieldProgrammableGateArray (FPGA) or Reconfigurable Computing (RC) Array cycletimes to be greater than processors in the same process. Mod-ern FPGAs seldom achieve application clock rates close to theirprocessor cousins because (1) resources in the FPGAs are not bal-anced appropriately for high-speed operation, (2) FPGA CAD doesnot automatically provide the requisite transforms to support thisoperation, and (3) interconnect delays can be large and vary almostcontinuously,complicating high frequency mapping. We introducea novel reconfigurable computing array, theHigh-Speed, Hierarchi-cal Synchronous Reconfigurable Array (HSRA), and its supportingtools. Thispackagedemonstrates thatcomputingarrays canachieveefficient, high-speedoperation. We have designed andimplementeda prototype component in a 0.4 m logic design on a DRAM processwhich will support 250MHz operation for CAD mapped designs.A common myth about FPGAs is that they are inherently 10slowerthan processors. We see no physicallimitations whichwouldmake this true, but there are some good reasons why this mythpersists.Looking at raw cycle times, we see that the potential operatingfrequencies for FPGAs are comparable to processors in the sameprocess (See Table 1). The cycle time on a processor representsthe minimum interval at which a new operation on new data can beinitiated or completed. That is, it defines how fast we can clock thecomputational and memory units and reuse them to perform subse-quent operations. Since traditional FPGAs are not synchronous, itis not as obvious what the native cycle time is for an FPGA. How-ever, if we also take the FPGA cycle time as the minimum intervalat which we can launch a new datum for computation, then we canidentify a cycle time. For example, the XC4000XL-09 family hasa logic evaluation to clock setup time of 1.6 ns, and a clock-to-Qtime of 1.5 ns. If we take the minimum clock low and high-times of2.3 ns each, we can define a cycle of 4.6 ns which leaves (4.6-1.5-1.6)=1.5 ns for interconnect on each cycle. Similarly, Von Herzendefined a 4 ns cycle on XC3100-09 and designed his signal pro-cessing applications to this cycle time [11]. In Table 1, we see thatTo appear in the Seventh International Symposium on Field-Programmable Gate Arrays, February 21–23, Monterey, CA.these cycle times are within a factor of two of processors in thesame process.Inpractice, however,theapplications we seerunningat200MHz+on theseFPGAs are fewand farbetween. While the basiccycle timefor an FPGA is small, most contemporary FPGA designs run muchslower—more typically in the 25-70MHz range. Why do designsrun this much slower than the conceivable peak? We conjecturethere are several factors which contribute to the low frequency ofmost FPGA designs:1. no reason to run faster — Often the limited speed is all theuser wants orneeds,andthere is no application reasontorunata higher cycle rate. For example, if the application is samplerate limited at a modest sample rate, there is no requirementto process data at a higher rate. Furthermore, when datarates are limited by system components outside of the FPGAor standards, the application may have no cause to run at afaster rate. However, whensuch externalor application limitsappear, it is often possible to reduce the hardware required byrunning a more serialized design in less space (fewer gates,smaller FPGA component) at the higher cycle rate achievableby the FPGA.2. cyclic data dependencies limit pipelineability – Cycles inthe flow graph define a minimum clock cycle time. We can-not pipeline down to the LUT level within such cycles. Wecan, however, run the design -slow [14] at the LUT-cyclerate, allowing us to solve -independent problems simulta-neously in the hardware space. If we do not have a numberof independent problems to solve, we can reuse gates andinterconnect at the LUT-cycle rate to solve the problem inless area when the device has multiple contexts (e.g. DPGA[6]).3. inadequate tool support – Reorganizing a design to run atthis tight cycle rate can be a tedious task. While the basictechnology is known in the design automation world, typicalFPGA tools and design flows do not provide support foraggressive retiming. In part this results from the traditionalglue-logic replacement philosophy which lets the user definethe base cycle and what has to happen within a cycle, ratherthan taking a computational view which says that the userdefines a taskand thetoolsarefreeto transformtheproblemasnecessaryto map the user’s task onto the computingplatform.4. interconnect delays dominate – Interconnectdelays dependon the distance between source and sink and can easily dom-inate all other delays. We were only able to define the tightcycle times we did above by assuming very local communi-cations. If we allowed even one cross chip delay time in the1Design Feature Cycle ReferenceXC4000XL-09 0.35 m 4.6 ns [20]A10K100A-1 0.35 m 5.0 ns [1]Strong Arm 0.35 m 5.0 ns [15]Alpha 0.35 m 2.3 ns [10]SPARC 0.35 m 3.0 ns [9]Pentium 0.35 m 3.3 ns [4]Alpha 0.35 m 1.7 ns [7]HSRA 0.40 m 4.0 ns5.0 ns cycle based onmin min2 5 nsTable 1: Cycle Rate comparison at 0.35 mNumber of Registers 1 2 3 4 5 6 7 8 9 10Percentage 72 16 4.5 2.2 1.3 0.96 1.2 0.46 0.12 0.11Table 2: Benchmark-Wide Distribution of Registers Required between LUTscycle, the cycle time would increase significantly. This leadsus to believe that eitherwe have to accept a much larger cycletime, or we must limit all communications to local connec-tions, as in [11]. As long as we must traverse an entire longinterconnect line in a single cycle, we are left where we canonly achievethe tightcycle forverystylizedproblems or withheroic personal effort to design and layout the computationentirely using local connections.5. pipelining becomes expensive – In order to pipeline thedevice heavily enough to run at this cycle rate, the designneedsa larger number offlip-flopsforproper retiming. Whileflip-flops are “relatively” cheap in many FPGAs, the typicalbalance is roughly one flip-flop per 4-LUT. However, for afully pipelined
View Full Document