UT EE 382C - Cache Justification for Digital Signal Processors - D802235

Home> Schools> University of Texas at Austin> Electrical Computer (EE) > EE 382C> Cache Justification for Digital Signal Processors

UT EE 382C - Cache Justification for Digital Signal Processors

School name University of Texas at Austin

Course Ee 382c- Topics in Computer Engineering

Pages 10

Download Save

Unformatted text preview:

Cache Justification for Digital Signal ProcessorsbyMichael J. LeeDecember 3, 19992Cache Justification for Digital Signal ProcessorsByMichael J. LeeAbstractCaches are commonly used on general-purpose processors (GPPs) to improveperformance by reducing the need to go to off-chip memory every time programinstruction or data is needed. However, digital signal processors (DSPs) traditionally didnot incorporate any caches, but instead mainly relied on fast on-chip memory banks.Although some DSPs made use of small instruction caches, none had integrated a datacache. This paper will discuss the justification for having caches on DSP processors andthe performance impact of using them. It will mainly analyze Texas Instruments’TMS320C6211, which uses a two-level caching scheme for both instruction and data.The paper will conclude with my findings on cache utilization of the TMS320C6211 byevaluating several commonly used DSP kernels on a commercially available simulator.3INTRODUCTIONMore and more consumer products and cost-sensitive systems are incorporatingDSPs into their designs. The demand for products with signal processing capabilities isinfluencing the design goals of some DSPs. Traditionally, there was a fine distinctionthat set apart DSPs from GPPs; but now, as DSPs are getting more sophisticated, thedistinction is becoming more ambiguous. One area that used to set apart DSPs fromGPPs was the use of caches. GPPs routinely implemented caches into their designs, butDSPs have traditionally lacked caches, especially data caches. As processor cycle timescontinue to fall, more proportion of the peak processing power is being lost to thememory system. Caches offer a simple and effective way to combat the latency ofaccessing instructions and data [1].MEMORY ARCHITECTURESTypical DSP operations require high memory bandwidth because they are data-intensive. To satisfy the high throughput requirements, DSPs usually implement theHarvard architecture as opposed to the Von Neumann architecture. GPPs havetraditionally used the Von Neumann memory architecture (Figure 1a), which connectsone memory space to the processor core by one bus set—an address bus and a data bus.The memory bandwidth was sufficient to sustain many GPP applications with plenty ofinstructions and data. However, the memory bandwidth requirements of DSPs make theVon Neumann architecture a poor choice. Thus, most DSPs implement a Harvardmemory architecture (Figure 1b).The Harvard memory architecture uses two memory spaces, usually partitioned asprogram memory and data memory. The two memory spaces are connected to the4processor core by two bus sets, allowing for two simultaneous accesses to memory. This,in effect, doubles the processor’s memory bandwidth; and thus, keeps the processor corewell fed with large amounts of instructions and data [2]. Modern high-performance GPPssuch as the Pentium and PowerPC also implement a similar Harvard architecture to makemultiple memory accesses per instruction cycle for superscalar execution and for highdata demands [3].Figure 1: Memory architectures. (a) von Neuman (b) Harvard [3]GPP AND DSP CACHE NEEDSMost high-performance GPPs typically contain two on-chip caches—one for dataand one for instructions. The performance requirements of GPPs make quick memoryaccesses an essential criteria in order to meet their design goals, but the high performancenature of these chips often makes it impossible to integrate memory chips capable ofkeeping pace with the processors. Therefore, caches provide a viable way of sustaining5data demands as well as allowing for data retrieval at full processor speed withoutaccessing slow, off-chip memory. The importance of quick memory access can be seenby IBM’s highly anticipated Power4 chip, which is allocating 60 percent of the transistorsto cache memory [4].Unlike GPPs, most DSP processors do not have any cache. Instead, they rely onmultiple banks of on-chip memory and multiple bus sets to allow for several on-chipmemory accesses per instruction cycle. However, some DSPs do incorporate a small,specialized instruction cache that is used for storing instructions used in small loops sothat the on-chip bus sets can be free to retrieve data words. DSP processors almost neverinclude a data cache because the data is typically “streaming.” In other words, datasamples are often used by the DSP processors to perform computations and then arediscarded with little need for reuse [3].The traditional DSP design has changed with the introduction of TexasInstruments’ TMS320C6211, which not only includes instruction and data caches, butalso implements them in two-levels. At 150 MHz, it is capable of performing 1200 RISCMIPS, with up to eight instructions per cycle. It has 72 KB of on-chip RAM—4 KB ofL1 program cache, 4 KB of L1 data cache, and 64 KB of L2 unified cache [4]. TheC6211 is expected to be used for price sensitive applications, such as digital subscriberloop (DSL) clients for small offices or the home as well as high-speed data transmissionfunctions in switches and routers, wireless data clients, imaging, biometrics, remotemedial diagnostics, automotive vehicle and drive train control, and security systems [6].6TMS320C6211The TMS320C6211 utilizes a two-level memory architecture for on-chip programand data accesses (Figure 2). The first level consists of 4 KB of direct-mapped programcache and 4 KB of 2-way set associative data cache. Separate and dedicated L1 cachesprevent conflicts that may arise due to fights for memory resources between the programand data busses. A direct map is well suited for the L1P since DSP algorithms consist ofsmall, tight loops. Set associativity is more appropriate for the L1D because data tends tobe more random and have more strides than program instructions [8]. Also, the L1D usesa least-recently-used (LRU) replacement scheme, which produces cache allocations thatare very close to optimal [9]. The second level is made up of 64 KB that can be used byboth the program and data. It can be used entirely as a cache, be directly mapped asinternal memory, or be used as a combination of these functions. It is divided into four16 KB banks, each of which can be programmed as cache or RAM space. With theflexibility allowed by the L2, the user can optimally partition the cache with a balance ofRAM, program cache, and data cache [7].Figure 2: TMS320C6211 Digital Signal Processor

View Full Document


School:
Email:
New Password:
Confirm Password:

UT EE 382C - Cache Justification for Digital Signal Processors

Sign up for free to view:

Please select your school