Unformatted text preview:

VERY LONG INSTRUCTION WORD ARCHITECTURES AND THE ELI-512 JOSEPH A. FISHER YALE UNIVERSITY NEW HAVEN, CONNECTICUT 013520 ABSTRACT By compiling ordinary scientific applications programs with a radical technique called trace scheduling, we are generating code for a parallel machine that will run these programs faster than an equivalent sequential machine - we expect 10 to 30 times faster. Trace scheduling generates code for machines called Very Long Instruction Word architectures. In Very Long Instruction Word machines, many statically scheduled, tightly coupled, fintgrained operations execute in parallel within a single instruction stream. VUWs are more parallel extensions of several current architectures. These current architectures have never cracked a fundamental barrier. The speedup they get from parallelism is never more than a factor of 2 to 3. Not that we couldn’t build more parallel machines of this type; but until trace scheduling we didn’t know how to generate code for them. Trace scheduling finds sufficient parallelism in ordinary code to justify thinking about a highly parallel VLIW. At Yale we are actually building one. Our machine, the ELI-512, has a horizontal instruction word of over 500 bits and will do 10 to 30 RISC-level operations per cycle [Pattenon 821. ELI stands for Enormously Longword Instructions; 512 is the size of the instruction word we hope to achieve. (The current design has a 1200-bit instruction word.) Once it became clear that we could actually compile code for a VLIW machine, some new questions appeared, and answers This research is sponsored in part by the National Science Foundation under grants MCS-81-08181 and MCS-81-07846, in part by the Office of Naval Research under grant number NOOO14-82-K-0184, and in part by the Army Research Office under grant number DAAC2881-K-0171. are presented in this paper. How do we put enough tests in each cycle without making the machine too big? How do we put enough memory references in each cycle without making the machine too slow? Everyone wants to use cheap hardware in parallel to speed up computation. One obvious approach would be to take your favorite Reduced Instruction Set Computer, let it be capable of executing 10 to 30 RISC-level operations per cycle controlled by a very long instruction word. (In fact, call it a VLIW.) A VLIW looks like very parallel horizontal microcode. More formally, VLIW architectures have the following properties: There is one central control unit issuing a single long instruction per cycle. Each long instruction consists of many tightly coupled independent operations. Each operation requires a small, statically predictable number of cycles to execute. Operations can be pipelined. These properties distinguish VLlWs from multipmcasors (with large asynchronous tasks) and dataflow machines (without a single flow of contml, and without the tight coupling). VLIWs have none of the required regularity of a vector processor, or true array processor. Many machines appmximately like this have been built, but they have all hit a very low ceiling in the degree of parallelism they provide. Besides horizontal microcode engines, these machines include the CDC 6600 and its many successors, such as the scalar portion of the CRAY-1; the IBM Stretch and 360/91; and the Stanford MIPS [Hennessy 821. It’s not surprising that they didn’t offer very much parallelism. Experiments and experience indicated that only a factor of 2 to 3 speedup fmm parallelism was available within basic blocks. (A basic block of code h= no jumps in except at the beginning 0 1983 ACM 0149-7111/83/0600/0140$01.00 263and no jumps out except at the end.) No one knew how to find parallelism beyond conditional jumps, and evidently no one was even looking. It seemed obvious that you couldn’t put operations from different basic blocks into the same instruction. There was no way to tell beforehand about the flow of control. How would you know whether you wanted them to be executed together? Occasionally people have built much more parallel VLIW machines for special purposes. But these have been hand- Coded. Hand-coding long-instruction-word machines is a horrible task, as anyone who’s written horizontal microcode will tell you. The code arrangements are unintuitive and nearly impossible to follow. Special-purpose processon can get away with hand coding because they need only a very few lines of code. The Floating Point Systems AP-120b can offer speedup by a factor of 5 or B in a few special-purpose applications for which code has been handwritten at enormous cost. But this code does not generalize, and most users get only the standard 2 or 3 - and then only after great labor and on small programs. We’re talking about an order of magnitude more parallelism; obviously we can forget about hand coding. But where does the parallelism come fmm? Not from basic blocks. Experiments showed that the parallelism within basic blocks is very limited [Tjaden 70, Foster 72). But a radically new global compaction technique called trace scheduling can find large degrees of parallelism beyond basic-block boundaries. Trace scheduling doesn’t work on some code, but it will work on most general scientific code. And it works in a way that makes it possible to bnild a compiler that generates highly parallel code. Experiments done with trace scheduling in mind verify the existence of huge amounts of parallelism beyond basic blocks [Nicolau 811. NicolauEl repeats an earlier experiment done in a different context that found the same parallelism but dismissed it; trace scheduling was then unknown and immense amounts of hardware would have been needed to take advantage of the parallelism [Riseman 721. WHY NOT VECTOR MACHINES? Vector machines seem to offer much more parallelism


View Full Document

UCF CDA 5106 - VLIW - Architectures

Download VLIW - Architectures
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view VLIW - Architectures and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view VLIW - Architectures 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?