Unformatted text preview:

Investigating The Utility of MMX SSE Instruction Sets Now And In The Future CS 15 740 Computer Architecture Final Project Report Computer Science Department Carnegie Mellon University Jernej Barbic Brian Potetz Matt Rosencrantz Abstract In this report we examine several multimedia applications with and without MMX SSE enhancements and examine the impact on execution time and cache performance of these enhancements We implement several versions of the programs to isolate their memory and processing requirements One criticism of SIMD technology is that it may be doomed to obsolesence as processors gain speed with respect to memory We discover that the multimedia applications we looked at are not memory bound Enhancing applications with MMX does make them more memory bound but not so much as to nullify the gain given by the enhancement We show that prefetching instructions can be used to hide memory latency and that MMX style enhancement will still be useful as long as the latency is predictable the memory bandwidth scales sufficiently and the total runtime of the program is large compared to the latency of memory 1 Introduction and Previous Work Our first goal is to test the ability of MMX and SSE to improve the performance of multimedia applications In this paper we compare enhanced and unenhanced versions of a wide variety of applications representative of important multimedia workloads both today and in the future There has been some work in this area but there are still many unexplored avenues In particular Bhargava et al investigated performance enhancements gained by MMX instructions on a suite of multimedia applications 2 but relied on the use of Intel MMX enhanced image processing libraries to gain performance This introduces waste due to operand arrangement and function call overhead Worse still the libraries implement generalized algorithms that cannot take advantage of simplifications allowed when targeting a specific application For example JPEG only requires 8x8 inverse discrete cosine transform IDCT operations and a general implementation of IDCT would not be able to take advantage of that fact This may have been why they observed a negative performance gain in their JPEG implementation Additionally SSE and prefetching were not explored We are also interested in investigating the continued utility of these multimedia instruction sets One criticism of SIMD instructions has been that processor speeds have been increasing faster than memory access times and that multimedia workloads are known to process large quantities of data It would therefore seem possible that the utility of SIMD may be decreasing as multimedia applications become more memory bound and that eventually it may no longer be worth supporting In this paper we use a variety of methods to test the veracity of this claim Slingerland et al conducted a detailed analysis of the cache performance of multimedia applications and concluded that they actually exhibit lower instruction miss ratios and comparable data miss ratios when contrasted with other widely studied workloads 4 In 6 the authors studied the effect of using Sun s VIS instruction set After adding VIS instructions some applications that had been compute bound became memory bound After inserting software prefetching however the applications became process bound again One shortcoming of this experiment was that the simulator used the RSIM superscalar processor simulator We seek to establish similar results on a real world processor that is in common use instead of using a simulator for a non existent CPU 2 Methodology All our results were collected on Pentium III 1GHz computers running Windows 2000 with 512MB or RAM 512K L2 cache and 16K instruction and data L1 caches The cache line size is 32 bytes for both L2 and L1 and all caches are write back 1 2 1 No Memory Mode The first quantity we wish to measure is the amount of time wasted by programs due to cache misses This statistic is somewhat difficult to obtain however because it is difficult to judge when an out of order superscalar processor is wasting time and more difficult still to give a precise reason for the waste when it occurs A crucial observation however is that a computer with an infinite pre loaded cache would give us the performance of the program without any waste due to bad cache performance and we would then be able to compute the wasted time by a simple subtraction Unfortunately it is not feasible to build such a device and no simulator was available that we could alter to achieve this behavior This same functionality can be simulated though by simply altering all loads and stores in the program to always read from a single small static buffer instead of traversing the enormous sea of data usually processed by the algorithm In general this method of measurement is not practical In particular it cannot be performed if the flow control of a program is dependent on the data you are trying to isolate it from Fortunately there is a large class of multimedia applications whose control flow is independent of the data decoding applications Decoding applications often do not make decisions based on their data they simply churn through it passing data from input through a series of steps to the output The procedure applied is independent of the data which means that it will take as long to process a simple repeating pattern of bytes as it would to process actual image data A final implementation detail that can arise when using this strategy is that artificial instruction dependencies can be introduced by altering the loads and stores making the code run artificially slow To avoid this problem separate buffers can be used for reading and for writing minimizing false dependencies and forcing all dependencies to be either read after read or write after write which are more palatable than other dependency types in practice We implemented this modification on several applications of our multimedia suite and successfully used it to isolate the time wasted by bad cache performance from the time required to perform the processing alone We call these modified versions no memory mode versions We made use of another simplifying idea you need not implement no memory mode everywhere but only in hot spots that exhibit a large number of cache misses In this way a large percentage of the cache misses in a program can be eliminated with relative ease 2 2 No Processor Mode Using the no memory mode versions


View Full Document

CMU CS 15740 - Investigating The Utility of MMX/SSE Instruction Sets Now And In The Future

Documents in this Course
leecture

leecture

17 pages

Lecture

Lecture

9 pages

Lecture

Lecture

36 pages

Lecture

Lecture

9 pages

Lecture

Lecture

13 pages

lecture

lecture

25 pages

lect17

lect17

7 pages

Lecture

Lecture

65 pages

Lecture

Lecture

28 pages

lect07

lect07

24 pages

lect07

lect07

12 pages

lect03

lect03

3 pages

lecture

lecture

11 pages

lecture

lecture

20 pages

lecture

lecture

11 pages

Lecture

Lecture

9 pages

Lecture

Lecture

10 pages

Lecture

Lecture

22 pages

Lecture

Lecture

28 pages

Lecture

Lecture

18 pages

lecture

lecture

63 pages

lecture

lecture

13 pages

Lecture

Lecture

36 pages

Lecture

Lecture

18 pages

Lecture

Lecture

17 pages

Lecture

Lecture

12 pages

lecture

lecture

34 pages

lecture

lecture

47 pages

lecture

lecture

7 pages

Lecture

Lecture

18 pages

Lecture

Lecture

7 pages

Lecture

Lecture

21 pages

Lecture

Lecture

10 pages

Lecture

Lecture

39 pages

Lecture

Lecture

11 pages

lect04

lect04

40 pages

Load more
Loading Unlocking...
Login

Join to view Investigating The Utility of MMX/SSE Instruction Sets Now And In The Future and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Investigating The Utility of MMX/SSE Instruction Sets Now And In The Future and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?