UW-Madison ECE 734 - Mapping DSP Algorithms to General-Purpose Out-of-order Processors - D1687481

Home> Schools> University of Wisconsin, Madison> Electrical and Computer Engr (ECE) > ECE 734> Mapping DSP Algorithms to General-Purpose Out-of-order Processors

DOC PREVIEW

UW-Madison ECE 734 - Mapping DSP Algorithms to General-Purpose Out-of-order Processors

School name University of Wisconsin, Madison

Course Ece 734- VLSI Array Structures for Digital Signal Processing

Pages 17

This preview shows page 1-2-3-4-5-6 out of 17 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 17 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 17 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 17 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 17 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 17 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 17 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 17 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

May 18, 2002. ECE734 Project ReportMapping DSP Algorithms to General-Purpose Out-of-orderProcessorsSpring 2002 ECE734 Project ReportIlhyun Kim and Donghyun [email protected], [email protected] 1 of 16May 18, 2002. ECE 734 Project ReportMapping DSP Algorithms to General-Purpose Out-of-orderProcessors1 .0 IntroductionIt is becoming popular to implement DSP algorithms on a general purpose processor. This is not onlybecause many applications often put more emphasis on the cost for developing and maintaining DSP software ratherthan the hardware cost that may be lowered by mass production, but also because a commodity-part general-purposemicroprocessor can easily achieve the performance required for the specific application that might be implementedonly by a special-purpose DSP processor. Considering this trend, it becomes more important to find a efficient way tomap DSP algorithms onto general-purpose microprocessors.Many existing DSP transformations try to extract the most parallelism from the algorithm to achieve highperformance by mapping multiple independent computations to different processing elements. Similarly, this is also agoal of high-performance compilers that exploit ILP (instruction level parallelism) to hide the latency of operationsby enabling overlap of instruction executions. At the same time, more and more general-purpose microprocessors arebuilt with out-of-order execution cores that dynamically re-order instructions so that they keep all available hardwareresources running by selecting independent operations within a limited scope over program binary (i.e. instructionwindow).Since these efforts to extract parallelism occur on different layers of high-level algorithm transformations,compiled binary and processor core independently without collaborations, some of efforts may be wasteful and evensome of them would negatively affect performance due to the lack of understanding on what actually happens in themicroprocessor. In this project, we study the effect of algorithm transformations that may be performed in the high-level program and assembly language levels for DSP applications running on a general purpose out-of-order proces-sor considering several real-world constraint. Specifically, we study the effect of single assignment, unfolding trans-formations performed on Alpha architecture [2] since they are closely related to optimizing loops that are common inmatrix manipulations. Additionally, we are focusing on identifying the source of constraint in loop operations by ana-lyzing data dependence graph.Based on the findings of efficient algorithm mapping techniques, we present a case study on optimizingMPEG-2 decoder [5] from Mediabench suite. Our optimization shows that the modified decoder significantly outper-forms the base code by reducing the execution time by 38%.The rest of the report is organized as follows: in Section 2, we present the overview on the out-of-order exe-cution core of general-purpose processors. Section 3 describes our simulation methodology and a brief explanationson the Alpha architecture that we modeled. In Section 4, the effect of algorithm transformations running on a general-purpose processor is studied. Finally, Section 5 presents a case study on optimizing a MPEG-2 decoder running on anPage 2 of 16May 18, 2002. ECE 734 Project ReportAlpha architecture implementation.2 .0 Out-of-order Execution OverviewMany current-generation general purpose processors are built with out-of-order execution core (OoO core)in which the hardware dynamically detects independent instructions with both inputs available and assigns functionalunits to those instructions [4]. An illustration of the structure is shows in Figure 1a. High-level language is compiledby a compiler that translates the high-level code written by the user into binaries, a machine language primitives thatactually realize program semantics. Dynamic pipeline scheduling goes past stalls to find later instructions to executewhile waiting for the stall to be resolved. Typically, the pipeline is divided into three major units: an instruction fetchand issue unit, execute unit, and a commit unit. The first unit fetches instructions, decodes them, and sends eachinstructions to corresponding functional unit of the execute stage. There might be 5 to 10 functional units. Each func-tional unit has buffers, called reservation stations, that hold the operands and the operation. As soon as the buffer con-tains all its operands and the functional unit is ready to execute, the result is calculated. It is then up to the commitunit to decide when it is safe to put the result into the register file or into memory. The basic model is multiple inde-pendent state machines performing instruction execution: one unit fetching and decoding instructions, several func-tional units performing the operations, and one unit deciding when instructions are complete so that the results can becommitted. To make programs behave as if they were running on a simple non-pipelined processor, the instructionfetch and decode unit is required to issue instructions in order, and the commit unit is required to write results to reg-isters and memory in program execution order.Figure 1b shows the functional units are mapped to the program’s data dependence graph, considering theresource constraints. Since the hardware keeps track of data dependences of the instructions, functional units areautomatically assigned to possible operations on equi-temporal plane in the data dependence graph if the machine(a) (b)Figure 1. Out-of-order execution overview. (a) is a simple illustration of the out-of-order execution core with4 functional units that perform different independent operations in parallel. (b) shows the functional unit map-ping to the program’s data dependence graph, given resource constraints.for(i=1~m)for(j=1~n)c(i,j)=c(i,j-1)+c(i-1,j)functionalunitsPage 3 of 16May 18, 2002. ECE 734 Project Reportdoes not have any resource constraint. However, there are two elements that restrict this functional unit mappings:one is the finite number of functional units and the other is the finite scope over the instructions, which is calledinstruction window that detects instructions with all data dependences satisfied by the execution of parent instruc-tions. Typically, the instruction window size is 32 to 128 instructions. Even if all data dependences are resolved bythe previous executions, the operations should be serialized

View Full Document