Digital Leads the Pack with 21164Short Cycle Time Requires Two-Level CacheDoubling Instruction BandwidthFigure 1. The 21164 can issue four instructions per cycle...Cache Misses Don’t Stall ProcessorDeep Pipeline Similar to 21064Figure 2. The 21164 pipeline has the same length as the pipeline of …Table 1. The new Alpha processor significantly improves instruction …Execution Times ImprovedTable 2. The 21164 supports three levels of caches …Second-Level Cache is PipelinedHigh-Performance System InterfaceRunning Rings Around the CompetitionPerformance at Any Cost?Figure 3. The 21164 measures 16.9 x 18.6 mm and includes …Price and AvailabilityChip Set Available for 21164MICROPROCESSOR REPORTDigital Leads the Pack with 21164 Vol. 8, No. 12, September 12, 1994 © 1994 MicroDesign Resourcesby Linley GwennapThree years ago, unbeknownst to its participants, arace began among five major CPU vendors to bring tomarket the next generation of RISC technology. Each re-alized that pushing the performance envelope beyond200 SPECint92 would require aggressive superscalardispatch and high clock rates. While Hewlett-Packard,IBM, MIPS Technologies, and Sun struggled with morecomplicated designs, Digital has emerged from the packto take the checkered flag with its new 21164 design,known internally as EV-5.The company didn’t skimp on performance to has-ten the chip’s debut; not only is the 21164 the first micro-processor to exceed 200 SPECint92, but it should reach arating of 330 when running at 300 MHz, according toDigital’s estimates, with an astounding SPECfp92 scoreof 500. These scores more than double those of any non-Alpha microprocessor shipping today, and they shouldkeep Alpha in the performance lead even when othervendors deploy their own next-generation chips.The design can issue four instructions per cycle intotwo integer units and two floating-point units, for a peakexecution rate of 1.2 BIPS (billion instructions per sec-ond). It is the first microprocessor to include a secondarycache as well as primary caches on chip: the unifiedlevel-two (L2) cache is 96K in size, pushing the transis-tor count to 9.3 million, another first. (Previously, themost transistors on a general-purpose microprocessorwere the 3.6 million on the PowerPC 604.) Digital al-ready has working samples, which achieve the perfor-mance noted above, and plans to ship the 21164 in 1Q95.Although the Alpha chip can really fly, it has a fewdrawbacks. The processor’s vast die size (298 mm2),transistor count, and advanced 0.5-micron IC processpush its estimated manufacturing cost to a towering$430, and Digital is quoting a shocking initial price of$2,937 for the 300-MHz part. The 21164 breaks anotherless desirable record by dissipating nearly 50 W at itspeak operating frequency.Short Cycle Time Requires Two-Level CacheWith its 1,200-MIPS peak throughput, the 21164requires tremendous instruction and data bandwidth tofeed its ravenous engine, far more than could be suppliedfrom external cache RAMs. The design requires a largeon-chip cache to buffer the high-bandwidth CPU fromthe lower-bandwidth external world. With the 0.5-micron process, Digital knew it could push the on-chipcache well beyond the 32K used by most current RISCprocessors to 64K or even 128K.But even with Digital’s CMOS-5 process (see080504.PDF), the design team could not create a largecache array that could return data in a single 3.3-nsclock cycle. In a large array, it simply takes too long forthe address to propagate through the array and for thedata to propagate back. The best the designers could dowas a 16K cache similar to the ones used in the 275-MHz21064A, which is also built in CMOS-5. This size didn’twork for the 21164, because the data cache had to bedual ported, doubling the die area of the array. Thus, thenew processor includes two primary caches—one for in-structions, one for data—of 8K each that can be accessedin a single 3.3-ns cycle.But the design needed more fast memory on chipthan just 16K, leading to the two-level cache scheme.The second-level cache array is 96K in size and requirestwo cycles (6.7 ns) to access due to its larger physicalsize. Including cycles for tag access and level-one refill,the total cache-miss penalty for the primary caches is sixcycles (20 ns) on an L2 hit. An external L2 cache, in con-trast, requires at least 25 ns to service an L1 miss in the21064. Thus, moving the L2 cache on chip reduces thecache-miss penalty, improving performance.Putting the second-level cache on the chip has addi-tional benefits. The 21164 uses a three-way set-associa-tive L2 cache, which increases the hit rate comparedwith the direct-mapped L2 caches used by most proces-sors. It is difficult to implement set-associative cachesexternally due to the high pin-count required, but thisSEPTEMBER 12, 1994VOLUME 8 NUMBER 12Digital Leads the Pack with 21164First of Next-Generation RISCs Extends Alpha’s Performance LeadMICROPROCESSOR THE INSIDERS’ GUIDE TO MICROPROCESSOR HARDWAREREPORT2 Digital Leads the Pack with 21164 Vol. 8, No. 12, September 12, 1994 © 1994 MicroDesign ResourcesMICROPROCESSOR REPORTdifficulty is not an issue for on-chip caches.The two-level organization allows a more efficientuse of resources. The large unified cache offers a higherhit rate than split caches of the same total size. Becausethe L1 data cache must be dual ported to service two ac-cesses per cycle, the two-level design also avoids theneed for a large dual-ported memory, which would haverequired much more die area; instead, only the small pri-mary data cache must be dual-ported.Finally, incorporating a large cache on chip reducesthe need for external cache. Once the price of the 21164comes down, it will be feasible to include it in a midrangesystem with no external cache, reducing system cost. Dig-ital believes that the performance reduction in this con-figuration should be less than 10% for many applications.Doubling Instruction BandwidthFigure 1 shows a block diagram of the 21164. In-structions are read in groups of four from the instructioncache and are placed into one of two four-word buffers.The dispatcher then issues as many instructions as pos-sible from the current buffer; it must, however, com-pletely empty one buffer before moving on to the next.For example, if three instructions are issued on one cycle,the fourth must be issued by itself on the next cycle.To avoid this situation, the architects defined a“universal NOP” instruction
View Full Document