DOC PREVIEW
UW-Madison ECE 734 - Mapping DSP Algorithms to General-Purpose Out-of-order Processors

This preview shows page 1-2-3-4-5-6 out of 17 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 17 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 17 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 17 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 17 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 17 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 17 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 17 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

May 18, 2002. ECE734 Project ReportMapping DSP Algorithms to General-Purpose Out-of-orderProcessorsSpring 2002 ECE734 Project ReportIlhyun Kim and Donghyun [email protected], [email protected] 1 of 16May 18, 2002. ECE 734 Project ReportMapping DSP Algorithms to General-Purpose Out-of-orderProcessors1 .0 IntroductionIt is becoming popular to implement DSP algorithms on a general purpose processor. This is not onlybecause many applications often put more emphasis on the cost for developing and maintaining DSP software ratherthan the hardware cost that may be lowered by mass production, but also because a commodity-part general-purposemicroprocessor can easily achieve the performance required for the specific application that might be implementedonly by a special-purpose DSP processor. Considering this trend, it becomes more important to find a efficient way tomap DSP algorithms onto general-purpose microprocessors.Many existing DSP transformations try to extract the most parallelism from the algorithm to achieve highperformance by mapping multiple independent computations to different processing elements. Similarly, this is also agoal of high-performance compilers that exploit ILP (instruction level parallelism) to hide the latency of operationsby enabling overlap of instruction executions. At the same time, more and more general-purpose microprocessors arebuilt with out-of-order execution cores that dynamically re-order instructions so that they keep all available hardwareresources running by selecting independent operations within a limited scope over program binary (i.e. instructionwindow).Since these efforts to extract parallelism occur on different layers of high-level algorithm transformations,compiled binary and processor core independently without collaborations, some of efforts may be wasteful and evensome of them would negatively affect performance due to the lack of understanding on what actually happens in themicroprocessor. In this project, we study the effect of algorithm transformations that may be performed in the high-level program and assembly language levels for DSP applications running on a general purpose out-of-order proces-sor considering several real-world constraint. Specifically, we study the effect of single assignment, unfolding trans-formations performed on Alpha architecture [2] since they are closely related to optimizing loops that are common inmatrix manipulations. Additionally, we are focusing on identifying the source of constraint in loop operations by ana-lyzing data dependence graph.Based on the findings of efficient algorithm mapping techniques, we present a case study on optimizingMPEG-2 decoder [5] from Mediabench suite. Our optimization shows that the modified decoder significantly outper-forms the base code by reducing the execution time by 38%.The rest of the report is organized as follows: in Section 2, we present the overview on the out-of-order exe-cution core of general-purpose processors. Section 3 describes our simulation methodology and a brief explanationson the Alpha architecture that we modeled. In Section 4, the effect of algorithm transformations running on a general-purpose processor is studied. Finally, Section 5 presents a case study on optimizing a MPEG-2 decoder running on anPage 2 of 16May 18, 2002. ECE 734 Project ReportAlpha architecture implementation.2 .0 Out-of-order Execution OverviewMany current-generation general purpose processors are built with out-of-order execution core (OoO core)in which the hardware dynamically detects independent instructions with both inputs available and assigns functionalunits to those instructions [4]. An illustration of the structure is shows in Figure 1a. High-level language is compiledby a compiler that translates the high-level code written by the user into binaries, a machine language primitives thatactually realize program semantics. Dynamic pipeline scheduling goes past stalls to find later instructions to executewhile waiting for the stall to be resolved. Typically, the pipeline is divided into three major units: an instruction fetchand issue unit, execute unit, and a commit unit. The first unit fetches instructions, decodes them, and sends eachinstructions to corresponding functional unit of the execute stage. There might be 5 to 10 functional units. Each func-tional unit has buffers, called reservation stations, that hold the operands and the operation. As soon as the buffer con-tains all its operands and the functional unit is ready to execute, the result is calculated. It is then up to the commitunit to decide when it is safe to put the result into the register file or into memory. The basic model is multiple inde-pendent state machines performing instruction execution: one unit fetching and decoding instructions, several func-tional units performing the operations, and one unit deciding when instructions are complete so that the results can becommitted. To make programs behave as if they were running on a simple non-pipelined processor, the instructionfetch and decode unit is required to issue instructions in order, and the commit unit is required to write results to reg-isters and memory in program execution order.Figure 1b shows the functional units are mapped to the program’s data dependence graph, considering theresource constraints. Since the hardware keeps track of data dependences of the instructions, functional units areautomatically assigned to possible operations on equi-temporal plane in the data dependence graph if the machine(a) (b)Figure 1. Out-of-order execution overview. (a) is a simple illustration of the out-of-order execution core with4 functional units that perform different independent operations in parallel. (b) shows the functional unit map-ping to the program’s data dependence graph, given resource constraints.for(i=1~m)for(j=1~n)c(i,j)=c(i,j-1)+c(i-1,j)functionalunitsPage 3 of 16May 18, 2002. ECE 734 Project Reportdoes not have any resource constraint. However, there are two elements that restrict this functional unit mappings:one is the finite number of functional units and the other is the finite scope over the instructions, which is calledinstruction window that detects instructions with all data dependences satisfied by the execution of parent instruc-tions. Typically, the instruction window size is 32 to 128 instructions. Even if all data dependences are resolved bythe previous executions, the operations should be serialized


View Full Document

UW-Madison ECE 734 - Mapping DSP Algorithms to General-Purpose Out-of-order Processors

Documents in this Course
Load more
Download Mapping DSP Algorithms to General-Purpose Out-of-order Processors
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Mapping DSP Algorithms to General-Purpose Out-of-order Processors and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Mapping DSP Algorithms to General-Purpose Out-of-order Processors 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?