MICROPROCESSOR REPORT MICROPROCESSOR REPORT THE INSIDERS GUIDE TO MICROPROCESSOR HARDWARE VOLUME 8 NUMBER 13 OCTOBER 3 1994 UltraSparc Unleashes SPARC Performance Next Generation Design Could Put Sun Back in Race by Linley Gwennap Flexible Instruction Alignment High end SPARC performance languishing at subPentium levels is set to receive a big boost next year when UltraSparc debuts Sun expects this next generation RISC chip to triple the performance of a 60 MHz SuperSparc moving SPARC from the back of the pack to within hailing distance of the lead The key to this incredible increase is a complete redesign of the processor pipeline to eliminate the constrictions of the SuperSparc design The result a projected clock speed of 167 MHz a huge jump for Sun and a respectable rate compared with other next generation RISC chips Unlike Digital which has already measured the performance of the 21164 Sun s performance estimates are conjecture as UltraSparc has not yet seen first silicon Sun has built test chips to verify the speed of its design and has performed extensive timing simulations hoping to avoid the embarrassment of its SuperSparc launch The design avoids SuperSparc s fatal flaws the double pumped register file and TLB but it remains to be seen whether Sun can deliver on its promises and turn a paper tiger into a real man eater The first announced processor to implement the SPARC version 9 architecture see 070201 PDF UltraSparc is a full 64 bit design It can issue as many as four instructions per cycle to nine function units two integer ALUs one load store unit one branch unit and five special purpose units for floating point and graphics calculations The chip has moderate on chip caches for a processor of its generation 16K for instructions and 16K for data less than SuperSparc To make up for these modest caches UltraSparc connects directly to a synchronous external cache that can return one result per cycle In addition to SPARC V9 the design implements a unique set of graphics and multimedia instructions Sun has not announced price or availability for the new processor which will be fabricated by Texas Instruments We expect UltraSparc to begin shipping in volume in 3Q95 six to nine months later than the 21164 UltraSparc Unleashes SPARC Performance Sun with the largest installed base of any RISC system vendor has always been concerned about the performance of existing unrecompiled binaries on new processors UltraSparc implements a simple scheme that avoids the instruction alignment restrictions that prevent the 21164 and other highly superscalar processors from achieving maximum performance without recompilation The SPARC chip fetches instructions into a 12 entry FIFO buffer the instruction dispatcher simply issues up to four instructions from the bottom of the buffer This scheme works well as long as the buffer is kept reasonably full For starters the instruction cache can deliver four instructions 128 bits per cycle to the buffer but branches can disrupt this flow To counter this problem the cache includes a next field that can redirect the fetch stream if the current instruction group contains a predicted taken branch For cache lines that do not contain such branches this field contains the next sequential address The contents of this field direct the next instruction fetch eliminating any penalty for correctly predicted taken branches As they are loaded into the cache instructions are partially decoded to determine if they contain a branch and if so what the target address is This information is used to initialize the next field In what is becoming a common superscalar design technique the instruction cache stores four bits of decode information with each instruction as well as two bits of branch history per cache line Sun s simulations show an 88 prediction accuracy on SPECint92 using these two history bits As Figure 1 shows instructions are further decoded before being placed in the instruction buffer Each entry in the buffer is 62 bits wide to contain all the decode information This extensive information allows the dispatch unit to quickly decide which instructions can be issued and even allows time for a register file access all in a single clock cycle Vol 8 No 13 October 3 1994 1994 MicroDesign Resources MICROPROCESSOR REPORT Instructions are always issued in order if an instruction cannot be issued due to a resource conflict or a register dependency no subsequent instructions are issued on that cycle Unlike SuperSparc the new design does not cascade the ALUs this change prevents dependent integer instructions from being paired but helps support the high clock rate One special case is that a store can be dispatched in the same cycle as the instruction that calculates the store data this case is handled by forwarding the result to the store queue There is one flaw that breaks the no alignment strategy The first three instructions can be dispatched to any function unit but the fourth can be sent to only the branch or floating point units Sun says that allowing the fourth slot to contain a general integer instruction would have greatly increased the amount of dependency checking but added little performance Restricting the fourth slot also reduces the number of ports in the integer register file Long Pipeline Includes FPU 128 bits Virtual Address UltraSparc uses a nine stage pipeline as Figure 2 shows The basic integer pipeline is actually six stages two more than in SuperSparc the additional stages at the back end support the floating point and graphics units The first two stages perform instruction fetch and decode As noted above the decoded instructions are placed in the instruction buffer If the buffer is not empty the typical situation instructions may wait one or more cycles before being dispatched to the function units in the G grouping stage The next two stages are the classic RISC execute and cache access stages Instead of completing with a writeback in the sixth stage three additional stages are added to wait for longlatency FP and graphics operations These stages make it easier to resolve FP traps The completion units hold results until they are written to the register file reducing the amount of bypassing needed for the long pipeline For floating point and graphics instructions the fourth E stage is used for addiInstr TLB Instruction Cache Predecode tional decoding and for accessing the FP reg 64 entries 16K plus next field branch hist 144 Unit ister file
View Full Document
Unlocking...