MIT 6 893 - Research Paper - D1365720

Home> Schools> Massachusetts Institute of Technology> Electrical Engineering and Computer Science (6) > 6 893> Research Paper

DOC PREVIEW

MIT 6 893 - Research Paper

School name Massachusetts Institute of Technology

Course 6 893-

Pages 22

This preview shows page 1-2-21-22 out of 22 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 22 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 22 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 22 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 22 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 22 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Vectorizing SPECint95Krste Asanovi´cComputer Science DivisionUniversity of California at BerkeleyBerkeley, [email protected] SPECint95 benchmark suite contains compute-intensive integer codes that are gener-ally regarded as non-vectorizable. This study reveals, however, that half of the benchmarkscan be accelerated significantly with vector execution, though this requires minor modifica-tions to the source code in some cases. For the T0 vector microprocessor, a geometric meanimprovement of 1.32 is obtained across all eight benchmarks, with individual speedups in therange 1.16–4.5. Profiling information reveals that the vector unit provides large speedups butonly on limited portions of the runtime, whereas superscalar processors provide more modestspeedups across the entire runtime. This result implies that vectors combined with superscalarinstruction issue will yield much larger speedups than if either technique is used in isolation.For example, the vector unit from T0 would only add about 7% to the area of the R10000superscalar microprocessor but would increase its SPECint95 performance by around 28%.These results demonstrate that a vector unit can improve cost/performance even for codes withlow levels of vectorization.1 IntroductionIncreases in the level of scalar instruction-level parallelism (ILP) supported by microprocessor mi-croarchitectures have been accompanied by dramatic increases in the die area of these scalar units,primarily because of the complexity of managing multiple scalar instructions at various stages ofexecution. Vector instruction set extensions provide a much simpler hardware mechanism for sup-porting high degrees of parallelism, provided computations can be cast into a data parallel form.Vectors have proven successful at improving the performance of supercomputers running scientificand engineering applications, but there has been little research applying vectors to other applica-tion areas. The need for greater performance on multimedia applications has renewed interest in1such data parallel instruction sets, with several manufacturers introducing data parallel extensionsto their instruction sets [PW96, TONH96, Lee96] and several research groups investigating vectormicroprocessor designs [WAK+96, KPP+97, LD97, Esp97]. It is therefore plausible that futuremicroprocessors will include small but powerful vector units to accelerate both new multimediaand human-machine interface applications as well as traditional scientific and engineering tasks.Currently, only supercomputers have vector units and so few non-supercomputer applications arewritten with vectorization in mind. If vector units become commonplace on low-cost systems andprovide the fastest mode of execution, compiler writers and programmers will have an incentive tovectorize a wider range of tasks.The SPEC95 benchmarks [Corb] have become popular both to measure performance of com-mercial processors and to provide the workload for compiler and architecture research studies.The SPEC95 benchmarks are divided into SPECfp95, which contains floating-point codes, andSPECint95, which contains compute-intensive integer codes. While SPECfp95 contains manyprograms originally developed for vector supercomputers and which are known to be highly vector-izable, the SPECint95 benchmarks are generally regarded as non-vectorizable. This paper reportson a study which reveals that half of the SPECint95 codes, including some described as “non-vectorizable” by the SPEC documentation, can be accelerated significantly using vector execution,though this occasionally requires some minor modifications to the source code.2 MethodologyFrom Amdahl’s law [Amd67], we know that vector speedup is given byTsTv=1(1,f)+f=VwhereTsis the execution time of the program running in scalar mode,Tvis the execution timefor the program running in vector mode,fis the fraction enhanced by vector execution (the vectorcoverage), andVis the vector speedup for that fraction. More generally, programs have several dif-ferent portions of their runtime that can be accelerated to differing degrees, which we can expressby rewriting Amdahl’s law asTsTv=1(1,Pifi)+Pi(fi=Vi)2whereViis the vector speedup for fractionfiof the runtime.The approach taken in this study has two main steps. The first step is to profile the SPECint95codes running on conventional superscalar microprocessor-based systems to identify routines thatare both time-consuming and vectorizable. Two workstations were used for the profiling measure-ments, a Sun Ultra-1/170 workstation running Solaris 2.5.1 and an Intel Pentium-II workstationrunning NetBSD1.3. The Sun machine has a 167 MHz UltraSPARC-I processor [CDd+95], within-order issue of up to 4 instructions per cycle. The Sun C compiler version 4.0 was used tocompile the code using the flags: -fast -xO4 -xarch=v8plus -dn. Execution time pro-files were obtained using either the -p compiler option with prof, or with high-resolution timercalls (Solaris gethrtimer) embedded in the application. The Pentium-II processor [Gwe97]runs at 300 MHz with out-of-order execution of up to 3 CISC instructions per cycle. The gccversion 2.7.2.2 compiler was used to compile the code with optimization flags: -O3 -m486.Ex-ecution time profiles were obtained using either the -pg compiler option with gprof, or withhigh-resolution timer calls (NetBSD gettimeofday).The second step is to port the codes to run on the T0 vector microprocessor [WAK+96] de-scribed below in the next section. The SPECint95 codes were not written with vectorization inmind, and in some cases, have been highly tuned for scalar execution. This artificially limits thepotential for vectorization, and so minor changes were allowed within source code modules pro-vided there were no changes to global data structures or program behavior. The intent is to modelhow these types of code would be written and tuned for future microprocessors with vector units,rather than to determine the effectiveness of automatic vectorization of the unmodified SPECint95sources. The scalar code was also modified and optimized to allow a fairer comparison.Execution is profiled both before and after vectorization on T0. These timings give values forthe vector speedup,Vi, as well as the fraction of runtime,fi, for each piece of vectorizable code.The workstation systems are also profiled to give values forfi.3 The T0 Vector MicroprocessorT0 is a research prototype containing a

View Full Document


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-21-22 out of 22 pages.

MIT 6 893 - Research Paper

Sign up for free to view:

Please select your school