Unformatted text preview:

Stream Vector Processing Unit: Stream Processing Using SIMD on a General Purpose Processor ELEC525 – Spring 2004 Michael Calhoun Michael Chang Manik Gadhiok Marjan KarkootiHypothesis Modern scalar processors inefficiently use fetch bandwidth when executing vectorizable code. By augmenting a general-purpose processor with a Stream Vector Processing Unit (SVPU), we can use the fetch bandwidth much more efficiently and achieve speed-ups in performance on vectorizable code. We also believe that by using a hierarchy of register files, this architecture will use memory bandwidth more efficiently by exploiting locality in data streams. Introduction The world of general-purpose computing has experienced a transformation from text and 2-D processing to media processing. Multimedia workloads are becoming increasingly dominant in general-purpose computing [1], [2]. The performance of applications such as high-resolution games, video conferencing, signal processing, and image manipulation on a typical superscalar leaves a lot to be desired. Chipmakers like Intel, AMD and Sun have rolled out media SIMD-style ISA extensions (MMX [3], 3DNow! [4], VIS [5] ) to meet the high computational demands of these applications and extract the data parallelism inherent in them. Fixed-function ASIC solutions also exist but the ISA extensions approach provides easier programmability, better performance, and easier upgrades from one generation to the next. Certain other applications, such as scientific code also exhibit data-level parallelism. We propose to augment a general-purpose processor with a stream-vector processing unit (SVPU) to improve performance on vectorizable numerical and multimedia applications. By converting the data-parallel code segments of these programs into vector instructions, we believe that we can utilize the memory bandwidth much more efficiently. In the rest of the paper, we will also refer to these code segments (showing potential of conversion to vector instructions) as vectorizable code. Ideas for the architecture for the SVPU are taken from the Imagine stream processor [6]. The SVPU is simpler, more scalable and complexity effective than existing architectures.The proposed architecture gives us the benefits of a vector unit together with a wide-issue superscalar processor. Architecture Our project is the modification of a superscalar general-purpose processor. The basic core of the GPP is left intact, and we augment it with a new functional unit, the Stream Vector Processor Unit. The SVPU works by executing special, instruction-set defined instructions on streams of data loaded from a special compiler controlled cache. Core Architecture: The basic GPP is a superscalar processor capable of fetching and issuing multiple instructions per clock cycle. Instructions are fetched, decoded, and then sent for Figure 1: Block Diagram of Processor with an on-chip Stream Unit Figure 2: Block Diagram of the Stream Unit (SRF/SVPU)execution down one of the different functional units (Integer ALU, FP ALU, BRU, LSU). Stream instructions are fetched and decoded exactly the same as scalar instructions, and then they are issued to the SVPU for processing (Figure1). Each instruction, including stream instructions, occupies a slot in the reorder buffer to ensure that the results commit in-order. Stream instructions stall in the issue logic when they depend on a value from the scalar core (ST.SSR, ST.LSR). When a stream instruction finishes, it signals the ROB that it has completed and commits its value. However, since the result is not written on a global bus or stored anywhere, SVPU instructions do not natively support precise exceptions. If an exception occurs and a stream instruction has already committed a value out-of-order that would cause a problem, the ROB will have an entry for the instruction and can trap to the OS for appropriate handling. The scalar register file can communicate with the SVPU, as certain instructions call for a read of GPR16 or a bulk register transfer from/to GPRs 8-15. The L1 instruction cache is read-only, and the L1 data cache is write-through so that the L2 cache contains all of the live copies and coherency information. The SVPU does not connect to the L1 or L2 caches of the superscalar core (only the memory controller), so the L2 cache must snoop the loads and stores to the memory controller and take appropriate coherency actions. The memory controller supports virtual channels and lazy-precharge to optimize for row reuse and memory access locality. SVPU Architecture The SVPU (Figure 2) is composed of the Stream Register File (SRF) and the Vector Processing Unit (VPU). The SRF is a compiler-controlled cache totaling 2KB in size composed of 32 stream registers (‘streams’), each made of 8 64-bit type-agnostic data words. Attached to the SRF is the SVPU Load/Store Unit (LSU), which interfaces between the SRF and either the scalar register file or memory. The LSU supports 8 outstanding memory requests, and can issue requests in-order and receive servicedrequests out of order. However, the LSU has to disambiguate memory addresses before allowing out of order commitment of results. The VPU is the core of the SVPU and is responsible for the execution of arithmetic operations. The VPU has four local stream registers, each holding 8 64-bit values. Each of the local registers is organized into four banks of 64-bit pairs, and the local registers are the only ones that can directly feed the ALUs of the VPU. The VPU has two fully pipelined 64-bit integer arithmetic units and two pipelined floating-point units. The VPU performs operations on the granularity of streams, supporting three-operand format instructions that perform the same operation 8 or 16 times on data from one or more streams. The SRF and VPU are fed instructions by the SVPU issue-queue/scheduler. Instructions enter the SVPU at the issue queue and are checked for dependencies using a basic scoreboard algorithm. When all dependencies are satisfied, the instruction can issue to the SRF LSU, VPU local register file, or the VPU core for execution. Instructions that use different resources can execute in parallel, i.e. ALU and memory operations. The SRF and LSU are connected by a 128-bit/cycle unidirectional bus, the SRF and VPU by a 256-bit/cycle unidirectional bus, and the scalar register file is connected to the LSU by a 64-bit unidirectional bus. Instruction


View Full Document
Download Stream Vector Processing Unit
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Stream Vector Processing Unit and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Stream Vector Processing Unit 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?