UltraSparc Unleashes SPARC PerformanceFlexible Instruction AlignmentFigure 1. UltraSparc includes five floating-point/graphics units …Long Pipeline Includes FPUNew Register File Saves SpaceFigure 2. UltraSparc uses a nine-stage pipeline …FPU Includes Multimedia SupportFigure 3. UltraSparc uses fast synchronous SRAM …High-Speed System InterfaceSun, TI Jettison BiCMOSCan UltraSparc Save Sun?Price & AvailabilityMicroSparc-3 RevampedMICROPROCESSOR REPORTUltraSparc Unleashes SPARC Performance Vol. 8, No. 13, October 3, 1994 © 1994 MicroDesign Resourcesby Linley GwennapHigh-end SPARC performance, languishing at sub-Pentium levels, is set to receive a big boost next yearwhen UltraSparc debuts. Sun expects this next-genera-tion RISC chip to triple the performance of a 60-MHzSuperSparc, moving SPARC from the back of the pack towithin hailing distance of the lead. The key to this in-credible increase is a complete redesign of the processorpipeline to eliminate the constrictions of the SuperSparcdesign. The result: a projected clock speed of 167 MHz, ahuge jump for Sun and a respectable rate compared withother next-generation RISC chips.Unlike Digital, which has already measured theperformance of the 21164, Sun’s performance estimatesare conjecture, as UltraSparc has not yet seen first sili-con. Sun has built test chips to verify the speed of its de-sign and has performed extensive timing simulations,hoping to avoid the embarrassment of its SuperSparclaunch. The design avoids SuperSparc’s fatal flaws (thedouble-pumped register file and TLB), but it remains tobe seen whether Sun can deliver on its promises andturn a paper tiger into a real man-eater.The first announced processor to implement theSPARC version 9 architecture (see 070201.PDF), Ultra-Sparc is a full 64-bit design. It can issue as many as fourinstructions per cycle to nine function units: two integerALUs, one load/store unit, one branch unit, and five spe-cial-purpose units for floating-point and graphics calcu-lations. The chip has moderate on-chip caches for a pro-cessor of its generation: 16K for instructions and 16K fordata, less than SuperSparc. To make up for these modestcaches, UltraSparc connects directly to a synchronousexternal cache that can return one result per cycle. Inaddition to SPARC V9, the design implements a uniqueset of graphics and multimedia instructions.Sun has not announced price or availability for thenew processor, which will be fabricated by Texas Instru-ments. We expect UltraSparc to begin shipping in vol-ume in 3Q95, six to nine months later than the 21164.Flexible Instruction AlignmentSun, with the largest installed base of any RISCsystem vendor, has always been concerned about theperformance of existing (unrecompiled) binaries on newprocessors. UltraSparc implements a simple schemethat avoids the instruction-alignment restrictions thatprevent the 21164 and other highly superscalar proces-sors from achieving maximum performance without re-compilation. The SPARC chip fetches instructions into a12-entry FIFO buffer; the instruction dispatcher simplyissues up to four instructions from the bottom of thebuffer.This scheme works well as long as the buffer is keptreasonably full. For starters, the instruction cache candeliver four instructions (128 bits) per cycle to the buffer,but branches can disrupt this flow. To counter this prob-lem, the cache includes a “next” field that can redirectthe fetch stream if the current instruction group containsa predicted-taken branch. For cache lines that do notcontain such branches, this field contains the next se-quential address. The contents of this field direct thenext instruction fetch, eliminating any penalty for cor-rectly predicted taken branches.As they are loaded into the cache, instructions arepartially decoded to determine if they contain a branchand, if so, what the target address is. This information isused to initialize the “next” field. In what is becoming acommon superscalar design technique, the instructioncache stores four bits of decode information with each in-struction as well as two bits of branch history per cacheline. Sun’s simulations show an 88% prediction accuracyon SPECint92 using these two history bits.As Figure 1 shows, instructions are further decodedbefore being placed in the instruction buffer. Each entryin the buffer is 62 bits wide to contain all the decode in-formation. This extensive information allows the dis-patch unit to quickly decide which instructions can be is-sued and even allows time for a register file access, all ina single clock cycle.MICROPROCESSOR THE INSIDERS’ GUIDE TO MICROPROCESSOR HARDWAREREPORTOCTOBER 3, 1994VOLUME 8 NUMBER 13UltraSparc Unleashes SPARC PerformanceNext-Generation Design Could Put Sun Back in Race2 UltraSparc Unleashes SPARC Performance Vol. 8, No. 13, October 3, 1994 © 1994 MicroDesign ResourcesMICROPROCESSOR REPORTInstructions are always issued in order; if an in-struction cannot be issued due to a resource conflict or aregister dependency, no subsequent instructions are is-sued on that cycle. Unlike SuperSparc, the new designdoes not cascade the ALUs; this change prevents depen-dent integer instructions from being paired but helpssupport the high clock rate. One special case is that astore can be dispatched in the same cycle as the instruc-tion that calculates the store data; this case is handledby forwarding the result to the store queue.There is one flaw that breaks the “no alignment”strategy. The first three instructions can be dispatchedto any function unit, but the fourth can be sent to onlythe branch or floating-point units. Sun says that allow-ing the fourth slot to contain a general integer instruc-tion would have greatly increased the amount of depen-dency checking but added little performance. Restrictingthe fourth slot also reduces the number of ports in the in-teger register file.Long Pipeline Includes FPUUltraSparc uses a nine-stage pipeline, as Figure 2shows. The basic integer pipeline is actually six stages,two more than in SuperSparc; the additional stages atthe back end support the floating-point and graphicsunits.The first two stages perform instruction fetch anddecode. As noted above, the decoded instructions areplaced in the instruction buffer. If the buffer is notempty (the typical situation), instructions may wait oneor more cycles before being dispatched to the functionunits in the G (grouping) stage. The next two stages arethe classic RISC
View Full Document