MIT 6 375 - A HARDWARE ACCELERATOR STORE FOR LOW POWER PROCESSORS - D2449100

Home> Schools> Massachusetts Institute of Technology> Electrical Engineering and Computer Science (6) > 6 375> A HARDWARE ACCELERATOR STORE FOR LOW POWER PROCESSORS

DOC PREVIEW

MIT 6 375 - A HARDWARE ACCELERATOR STORE FOR LOW POWER PROCESSORS

School name Massachusetts Institute of Technology

Course 6 375- Complex Digital Systems

Pages 17

This preview shows page 1-2-3-4-5-6 out of 17 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 17 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 17 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 17 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 17 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 17 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 17 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 17 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

A HARDWARE ACCELERATOR STORE FOR LOW POWER PROCESSORSMay 15, 2008GROUP 1Mike LyonsKevin BrownellDurlov KhanMotivationCustom hardware design is widely known for its ability to improve performance and increase power efficiency. ASICsoften rely on custom hardware in the form of IP blocks or hardware accelerators. These accelerators typically follow acompletely black-box model and contain all the memory modules required for operation. Including memory in theaccelerator gives the designer full control of the memory, and therefore reliable performance.The internalized memory approach can result in inefficient use of area, degraded performance during acceleratorinteraction, redundant logic and subpar memory power consumption. Area may be wasted because each accelerator maynot utilize all of its memory simultaneously. Further, communication performance between IP blocks may be degraded ifthe ASIC must copy large amounts of memory between private memory blocks in each accelerator. By removing inter-accelerator communication bottlenecks, we can reduce accelerator granularity in some cases by decomposing largeaccelerators into several smaller and more widely applicable accelerators. If multiple accelerators require support foradditional memory features such as power reduction (VDD-gating) or data structures (queues or stacks), this logic wouldbe duplicated across several accelerators. Alternatively, ASIC designers may not be able to add support for memorypower reduction to accelerators that lack native support. The accelerator store is a shared memory framework. It provides a common interface for each accelerator to read andwrite memory in a variety of ways. This allows accelerator designers to separate logic from state and any non-transientdata would be stored in the accelerator store. This simplifies VDD-gating design of the accelerator, since gating a logic-only accelerator would not lose any state. The accelerator store consists of several memory modules. If the memorymodule contains valid data, the module remains on. If the memory module does not contain valid data, it can be VDD-gated to reduce leakage power.This design also provides accelerators a richer means of communication. Rather than copying data from one accelerator'sprivate memory to another's, accelerator-to-accelerator communication can follow a producer/consumer model. Dataproducing accelerators can use FIFOs/Queues to insert data into the accelerator store. Data consuming accelerators canthen read this data in order. With this operation style, accelerators do not need to be specifically designed with each otherin mind. Simultaneously, no arbitrating logic is required beyond initializing the accelerator store and pointing theproducers and consumers to the correct FIFOs/Queues.One large concern with removing memory from within accelerators and creating a shared memory system is a reduction inperformance. Whereas accelerators with embedded memory could specifically design memory accesses to satisfy requiredbandwidth for internal operations, the memory accesses with a shared memory system could be limited by the sharedmemory's bandwidth. To address this point, we have previously developed a cycle accurate simulator to demonstrate thatthe accelerator store does not induce a bottleneck for these internal operations. We can also use this simulator to showperformance improvements due to improved inter-accelerator communication.High Level System DescriptionOur system reflects the figure shown on the right. Aprocessor (not pictured) communicates over a systembus (address/data) to several accelerators. Eachaccelerator is memory mapped, so the processor canread from and write to accelerators using standardmemory instructions. The system is designed formaximum energy savings and targets low frequencyapplications, such as those found in sensor networks.A management interface to the accelerator store isavailable over the system bus via memory mappedoperation. This management interface allows theprocessor to create or remove “handles” that representa portion of memory in the accelerator store. A handletable in the accelerator store maintains a list of all activehandles and any meta-data required for their operation.This meta-data includes head and tail pointers required for queue/FIFO operation, and a few mode flags. This interface isdescribed in a later section.The processor is responsible for assigning these handles to accelerators. The accelerators then use these handles whenaccessing the accelerator store. The accelerator store is able to accept several operation requests per cycle, each sent over an accelerator store port. Theaccelerator store bus consists of the full set of these accelerator store ports. However, it may not be able to fulfill requestsfrom all ports during the same cycle. Our design multiplies the internal accelerator store clock frequency by 4x so that wecan process four requests per accelerator cycle. We implemented a cycle accurate simulator in prior work to examine the operation, performance, and memory energyusage characteristics of the accelerator store design for multiple sensor network applications. We previously found that theaccelerator store was able to reduce powered-on memory, simplify accelerator and application design, reduce memory-related area with negligible or even beneficial effects on performance. We modified the simulator to emit traces in orderto validate our hardware implementation and quantify the energy, power, and timing overheads due to the acceleratorstore.System Software ArchitectureThe accelerator store provides two independent software interfaces: the configuration/management interface for use bythe general purpose processor over the system bus and the accelerator interface for use by the accelerators in the system.System software is responsible for configuring the accelerator store and managing the coordination among accelerators,and between accelerators and the accelerator store. The configuration/management interface allows system software toperform configuration tasks such as setting up the handles in the accelerator store's handle table. It also provides supportfor management functions for coordinating among accelerators, such as when initializing a producer-consumer model.Additionally, system software also coordinates between accelerators and the accelerator store by updating the prioritytable inside the accelerator store, based on the current system load and

View Full Document