UW-Madison ECE/CS 752 - Emulating Unimplemented Instructions in a Simultaneous Multithreaded Processor

Unformatted text preview:

1Emulating Unimplemented Instructions in aSimultaneous Multithreaded ProcessorSuan Yong and Brian ForneyCS/ECE 752Spring 2000AbstractEmulating unimplemented instructions can reduce the cost and powerrequirements of a processor by allowing functional units to be removed. But thehandling of unimplemented instruction exceptions in modern processors wastesfetch bandwidth and reduces throughput due to squashed instructions.Simultaneous Multithreaded (SMT) processors can avoid the waste by usingmultiple thread contexts to handle unimplemented instruction exceptions. Weinvestigate the effectiveness of SMT in improving the performance of emulatingunimplemented instructions. We focus on emulation of the Alpha 21164’s integermultiply instructions as a proof of concept.1 IntroductionSimultaneous Multithreaded (SMT) processors exploit thread-level parallelism for greaterperformance. Hardware thread contexts are used to fetch from multiple threads concurrently. Thiskeeps the processor busier than fetching from a single thread. Research in SMT processors haveshown promising performance improvement (about 2.5 times, according to [Tullsen]). Industryalso appears to be embracing SMT. Multithreaded processors, which are a precursor to SMTprocessors, have already begun shipping [Storino, Alverson]. SMT processors are underdevelopment with the Alpha 21464 [Lipasti] as an example of a planned SMT.SMT processors, however, require additional hardware beyond what a superscalar processordemands. SMT processors require a larger register file, multiple program counters, separateinstruction retirement and exception handling, and modified branch prediction. Other structures,like a return address stack, may also need to be extended [Tullsen]. Additional components add tothe complexity of a processor, often leading to increased processor cost, power, and developmentcost.Cost and power considerations are important for mobile and embedded computing. Mobilecomputers have limited power resources, which typically are less than users would like. Users ofmobile systems would like these systems to behave like desktop computers. Embeddedcomputing, like mobile computing, has cost and power constraints. Embedded systems functionin a variety of applications, including automobiles, PDAs, airplanes, and data acquisition devices.SMT processors’ additional complexity limits their use for applications like mobile andembedded computing, where cost and power are important. One solution is to remove functionalunits from the processor. However, partial emulation of the ISA will be needed to ensure binarycompatibility. Partial emulation is implemented as an unimplemented instruction exception.Exception handling via trapping requires instructions fetched after the excepting instruction to besquashed. For out of order and SMT processors, this means almost all instructions in theinstruction window are squashed since exceptions are not handled until commit time.Previous work [Zilles] on exception handling demonstrated speed-ups for multithreadedprocessors. The key technique from this work is exploiting hardware contexts to switch between a2main thread and an exception handling thread. We propose an extension of this work for handlingunimplemented instructions. This extension decreases the performance tradeoff made byremoving functional units and emulating their behavior with software. A small amount ofadditional hardware is needed, but the addition is smaller than the removed functional units, andis fixed regardless of the number of functional units removed.This paper is organized as follows. Section 2 discusses related work. We explain ourtechnique for improved emulation of unimplemented instructions in section 3. Section 4 containsour simulation and benchmark selection methodology. Experimental results are presented insection 5. We suggest future work in section 6. The paper concludes in section 7.2 Related workOur work builds on previous work in several areas. Zilles et al [Zilles] at the University ofWisconsin and Compaq studied exception handling in multithreaded processors. This workproposed the use of hardware contexts to handle exceptions without squashing fetchedinstructions. When a processor generates an exception, the thread that created the exception ispaused, and a new hardware thread is allowed to handle the exception. The hardware contextswitch conserves instructions fetched for the excepting thread after the excepting instruction.Zilles et al used TLB miss handlers as a case study of this technique. Our work is most closelyrelated to this work.Work at the University of Michigan by Chappell et al [Chappell] proposed the use ofsubordinate threads in a multithreaded processor to help a master thread. The subordinate threadsprovide services to the master thread. Services include cache prefetching and branch predictionallowing software greater control over program behavior than hardware based mechanisms.Threads in this model are started by a special SPAWN instruction injected in the binary by thecompilation system. Unlike Zilles’ work, subordinate threads fetch from a separate memorycalled a microRAM maintained by the operating system. Our emulation threads arefundamentally different from these subordinate threads in that our emulation threads need to havea higher priority than the main thread, since progress of the main thread is dependent on thecompletion of the emulation thread. A further difference is that emulation threads are started byprocessor exceptions rather than compiler-inserted SPAWN instructions.Various emulation techniques exist. The techniques can be divided into two categories: fullemulation of an ISA and emulation of unimplemented instructions, or partial emulation of anISA. Complete emulation of ISAs has been used in various systems including the IBM/360 to runIBM 1401, 7070, and 7090 applications, the Digital VAX for Digital PDP-11 binarycompatibility, the Apple PowerMac, HP’s PA-RISC line of workstations, and Compaq’s Alphasystems using FX!32 [Altman].Binary translation of ISAs is a subarea of general complete ISA emulation and has receivedincreased interest recently. Static and dynamic translators, like HP’s Aries [Zheng] for PA-RISCto IA-64 translation, and Java VMs, are examples. These systems create a new binary either off-line or during program execution. However, unlike interpreters, the translations are cachedtemporarily or permanently.Partial translation of an ISA traditionally


View Full Document

UW-Madison ECE/CS 752 - Emulating Unimplemented Instructions in a Simultaneous Multithreaded Processor

Download Emulating Unimplemented Instructions in a Simultaneous Multithreaded Processor
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Emulating Unimplemented Instructions in a Simultaneous Multithreaded Processor and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Emulating Unimplemented Instructions in a Simultaneous Multithreaded Processor 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?