Unformatted text preview:

Alpha and IA64 1 October 11, 1999Alpha and IA64Executive SummaryApplications have two types of parallelism: instruction-level parallelism and thread-levelparallelism. Instruction-level parallelism enables a processor to issue multipleinstructions in the same cycle. Instruction-level parallelism can be static (discovered bythe compiler at compile-time) or dynamic (discovered by the processor at run-time).Thread-level parallelism enables a processor to run multiple threads, processes, orprograms at the same time.An Alpha processor will exploit static and dynamic instruction-level parallelism with out-order execution, and thread-level parallelism with simultaneous multithreading. Out-of-order execution has a performance benefit of 1.5-2x over in-order execution.Simultaneous multithreading has benefit of 1.5-3x over single threaded execution. An IA64 processor will only exploit static instruction-level parallelism. It cannot takeadvantage of dynamic instruction-level parallelism or thread-level parallelism. IA64defines a set of architectural extensions to permit compilers to identify more instruction-level parallelism. These architectural extensions will make it very difficult for an IA64processor to implement out-of-order execution or simultaneous multithreading efficiently.For most applications, the small benefit that these architectural extensions give compilersdoes not equal the performance lost by not using these dynamic techniques.Alpha will be superior to IA64 on commercial applications. Commercial applications arevery sensitive to code size. The IA64 instruction encoding increases the code size of aprogram by at least 33%, and the compiler techniques required by the IA64 introducemany additional instructions. Commercial programs are difficult to analyze at compile-time, and IA64 cannot dynamically adjust to program behavior at run-time. Commercialprograms have very low instruction-level parallelism, but they are typically explicitlymultithreaded. Each thread is very sequential and includes long delays waiting formemory. The IA64 strategy of searching for instruction-level parallelism cannot find theorders of magnitude improvements available to Alpha through simultaneousmultithreading.Alpha will be superior to IA64 in high performance technical computing. Memorybandwidth and the scalability of the system limit the performance of most highperformance technical applications. Future Alpha processors are adding a low-latency,high-bandwidth memory interface on chip, together with on-chip support for distributedshared memory. The next generation Alpha processors will have the fastest memorysystem in the industry. Alpha will be the leader in high performance technicalcomputing.Alpha and IA64 2 October 11, 19991. IntroductionFuture Alpha processors will be developed around two architectural concepts: out-of-order execution and simultaneous multithreading.• Out-of-order execution enables the processor to schedule the execution of instructionsin an order that maximizes program performance. It has a proven benefit of 1.5-2xover in-order execution.• Simultaneous multithreading (SMT) enables multiple threads (or processes) to runsimultaneously on a single microprocessor. Most server applications are divided intomultiple threads, and SMT permits these applications to take full advantage of themultiple execution units on the processor. SMT has a benefit of 1.5-3x over singlethreaded execution.These two features permit an Alpha processor to exploit both thread-level parallelism andinstruction-level parallelism. The processor can use these two types of parallelisminterchangeably, and dynamically adapt to the varying requirements of the application.Intel has chosen a markedly different direction than Alpha. Intel is introducing a new 64-bit instruction set architecture called IA64. They have called the architecture EPIC, forExplicitly Parallel Instruction Computing, but it is essentially a VLIW (Very LongInstruction Word) architecture. The IA64 architecture is very similar to the Cydromemachine, a failed minisupercomputer company of the 1980s. The first implementation ofIA64 is called Merced, with a follow-on implementation called McKinley.With the IA64, Intel is focusing on a compiler-driven technology to increase instruction-level parallelism, and is ignoring other proven ways to improve performance on largeapplications. IA64 is developed for an in-order execution model, with a set of newarchitectural extensions to permit compilers to identify more instruction-level parallelism.These architectural extensions will make it very difficult for IA64 processors toimplement out-of-order execution or simultaneous multithreading efficiently. For mostapplications, the small benefit that these architectural extensions give compilers do notequal the performance lost by not using these dynamic techniques.2. Design PhilosophyIA64: a smart compiler and a dumb machineAlpha and IA64 3 October 11, 1999The IA64 design is a derivative of the VLIW machines designed by Multiflow andCydrome in the 1980s. The key idea is a generalization of horizontal microcode: in awide instruction word the processor presents control of all of the functional units to thecompiler, and the compiler precisely schedules where every operation, every register fileread, every bypass, will occur. In effect, the compiler creates a record of execution forthe program, and the machine plays that record. In the early VLIWs, if the compiler madea mistake, the machine generated the wrong results; the machine had no logic to checkthat registers were read in the correct order or if resources were oversubscribed. In moremodern machines such as the IA64 processors, the machine will run slowly (butcorrectly) when the compiler is wrong.The IA64 design requires the compiler to predict at compile-time how a program willbehave. Traditionally, VLIW-style machines have been built without caches and focusedon loop-intensive, vectorizable code. These restrictions mean the memory latency isfixed and branch behavior is very predictable at compile-time. However, IA64 will beimplemented as a general-purpose processor, with a data cache, running a wide variety ofapplications. In most applications, the latency of a memory operation is very difficult topredict; a cache-miss may have a latency that is 100 times longer than a cache hit.Alpha’s out-of-order design can dynamically adjust to the cache pattern of the


View Full Document

TRINITY CSCI 3294 - Lecture Notes

Download Lecture Notes
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture Notes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture Notes 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?