DOC PREVIEW
MIT 6 893 - Research Paper

This preview shows page 1-2-21-22 out of 22 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 22 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 22 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 22 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 22 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 22 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Vectorizing SPECint95Krste Asanovi´cComputer Science DivisionUniversity of California at BerkeleyBerkeley, [email protected] SPECint95 benchmark suite contains compute-intensive integer codes that are gener-ally regarded as non-vectorizable. This study reveals, however, that half of the benchmarkscan be accelerated significantly with vector execution, though this requires minor modifica-tions to the source code in some cases. For the T0 vector microprocessor, a geometric meanimprovement of 1.32 is obtained across all eight benchmarks, with individual speedups in therange 1.16–4.5. Profiling information reveals that the vector unit provides large speedups butonly on limited portions of the runtime, whereas superscalar processors provide more modestspeedups across the entire runtime. This result implies that vectors combined with superscalarinstruction issue will yield much larger speedups than if either technique is used in isolation.For example, the vector unit from T0 would only add about 7% to the area of the R10000superscalar microprocessor but would increase its SPECint95 performance by around 28%.These results demonstrate that a vector unit can improve cost/performance even for codes withlow levels of vectorization.1 IntroductionIncreases in the level of scalar instruction-level parallelism (ILP) supported by microprocessor mi-croarchitectures have been accompanied by dramatic increases in the die area of these scalar units,primarily because of the complexity of managing multiple scalar instructions at various stages ofexecution. Vector instruction set extensions provide a much simpler hardware mechanism for sup-porting high degrees of parallelism, provided computations can be cast into a data parallel form.Vectors have proven successful at improving the performance of supercomputers running scientificand engineering applications, but there has been little research applying vectors to other applica-tion areas. The need for greater performance on multimedia applications has renewed interest in1such data parallel instruction sets, with several manufacturers introducing data parallel extensionsto their instruction sets [PW96, TONH96, Lee96] and several research groups investigating vectormicroprocessor designs [WAK+96, KPP+97, LD97, Esp97]. It is therefore plausible that futuremicroprocessors will include small but powerful vector units to accelerate both new multimediaand human-machine interface applications as well as traditional scientific and engineering tasks.Currently, only supercomputers have vector units and so few non-supercomputer applications arewritten with vectorization in mind. If vector units become commonplace on low-cost systems andprovide the fastest mode of execution, compiler writers and programmers will have an incentive tovectorize a wider range of tasks.The SPEC95 benchmarks [Corb] have become popular both to measure performance of com-mercial processors and to provide the workload for compiler and architecture research studies.The SPEC95 benchmarks are divided into SPECfp95, which contains floating-point codes, andSPECint95, which contains compute-intensive integer codes. While SPECfp95 contains manyprograms originally developed for vector supercomputers and which are known to be highly vector-izable, the SPECint95 benchmarks are generally regarded as non-vectorizable. This paper reportson a study which reveals that half of the SPECint95 codes, including some described as “non-vectorizable” by the SPEC documentation, can be accelerated significantly using vector execution,though this occasionally requires some minor modifications to the source code.2 MethodologyFrom Amdahl’s law [Amd67], we know that vector speedup is given byTsTv=1(1,f)+f=VwhereTsis the execution time of the program running in scalar mode,Tvis the execution timefor the program running in vector mode,fis the fraction enhanced by vector execution (the vectorcoverage), andVis the vector speedup for that fraction. More generally, programs have several dif-ferent portions of their runtime that can be accelerated to differing degrees, which we can expressby rewriting Amdahl’s law asTsTv=1(1,Pifi)+Pi(fi=Vi)2whereViis the vector speedup for fractionfiof the runtime.The approach taken in this study has two main steps. The first step is to profile the SPECint95codes running on conventional superscalar microprocessor-based systems to identify routines thatare both time-consuming and vectorizable. Two workstations were used for the profiling measure-ments, a Sun Ultra-1/170 workstation running Solaris 2.5.1 and an Intel Pentium-II workstationrunning NetBSD1.3. The Sun machine has a 167 MHz UltraSPARC-I processor [CDd+95], within-order issue of up to 4 instructions per cycle. The Sun C compiler version 4.0 was used tocompile the code using the flags: -fast -xO4 -xarch=v8plus -dn. Execution time pro-files were obtained using either the -p compiler option with prof, or with high-resolution timercalls (Solaris gethrtimer) embedded in the application. The Pentium-II processor [Gwe97]runs at 300 MHz with out-of-order execution of up to 3 CISC instructions per cycle. The gccversion 2.7.2.2 compiler was used to compile the code with optimization flags: -O3 -m486.Ex-ecution time profiles were obtained using either the -pg compiler option with gprof, or withhigh-resolution timer calls (NetBSD gettimeofday).The second step is to port the codes to run on the T0 vector microprocessor [WAK+96] de-scribed below in the next section. The SPECint95 codes were not written with vectorization inmind, and in some cases, have been highly tuned for scalar execution. This artificially limits thepotential for vectorization, and so minor changes were allowed within source code modules pro-vided there were no changes to global data structures or program behavior. The intent is to modelhow these types of code would be written and tuned for future microprocessors with vector units,rather than to determine the effectiveness of automatic vectorization of the unmodified SPECint95sources. The scalar code was also modified and optimized to allow a fairer comparison.Execution is profiled both before and after vectorization on T0. These timings give values forthe vector speedup,Vi, as well as the fraction of runtime,fi, for each piece of vectorizable code.The workstation systems are also profiled to give values forfi.3 The T0 Vector MicroprocessorT0 is a research prototype containing a


View Full Document

MIT 6 893 - Research Paper

Documents in this Course
Toolkits

Toolkits

16 pages

Cricket

Cricket

29 pages

Quiz 1

Quiz 1

8 pages

Security

Security

28 pages

Load more
Download Research Paper
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Research Paper and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Research Paper 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?