A Low-overhead Scheduling Methodology for Fine-grained Acceleration of Signal Processing Systems

Home> Academic Documents> A Low-overhead Scheduling Methodology for Fine-grained Acceleration of Signal Processing Systems

DOC PREVIEW

This preview shows page 1-2-3-4-5 out of 16 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 16 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 16 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 16 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 16 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 16 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 16 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

1 A Low-overhead Scheduling Methodology for Fine-grained Acceleration of Signal Processing Systems JANI BOUTELLIER Machine Vision Group, University of Oulu, P.O.Box 4500, 90014 Finland [email protected], Tel. +358 8 553 2814, Fax +358 8 553 2612 SHUVRA S. BHATTACHARYYA Electrical and Computer Engineering Department, University of Maryland, College Park, MD, USA OLLI SILVÉN Machine Vision Group, University of Oulu, P.O.Box 4500, 90014 Finland Abstract. Fine-grained accelerators have the potential to deliver significant benefits in various platforms for embedded signal processing. Due to the moderate complexity of their targeted operations, these accelerators must be managed with minimal run-time overhead. In this paper, we present a methodology for applying flow-shop scheduling techniques to make effective, low- overhead use of fine-grained DSP accelerators. We formulate the underlying scheduling approach in terms of general flow-shop scheduling concepts, and demonstrate our methodology concretely by applying it to MPEG-4 video decoding. We present quantitative experiments on a soft processor that runs on a field-programmable gate array, and provide insight on trends and trade-offs among different flow-shop scheduling approaches when applied to run-time management of fine-grained acceleration. Keywords: Scheduling, parallel processing, digital signal processors 1 Introduction When a data processing system consists of multiple computing units that run in parallel, their mutual communication needs to be scheduled and synchronized in some manner to keep the results consistent. Most applications that use multiple processors, can handle their communication with a schedule that is static and computed at compile time. However, in the recent years several applications have emerged that run with underutilized processors, if pre-computed fully static schedules are used. In such applications, the schedules should be computed at run-time, which can lead to high computational overheads if the scheduling algorithm is inefficient. Journal of Signal Processing Systems, 2009, doi:10.1007/s11265-009-0366-z.2 In our context, scheduling is the task of computing a timetable for the processing units in the system, so that all units receive and transmit their data in a timely manner. Scheduling involves always some kind of optimization. The objectives of optimization can vary [1], but most often the purpose is to minimize the makespan. Makespan minimization refers to the procedure of finding a schedule that completes all assigned tasks in a minimal time period. 1.1 The benefits of fine-grained acceleration In this work we assume that the scheduled co-processors are hardwired accelerators. Hardwired accelerators are important in mobile applications, since their energy-efficiency is up to 50x higher than that of software implementations of the same algorithm [2]. In this paper we assume that the turnaround times of accelerators are short compared to traditional accelerators [3], i.e. they are fine-grained. Instead of using one monolithic accelerator, the accelerated functionality is implemented on several smaller units. Hardware accelerators can be considered fine-grained when their execution time is around 100-1000 clock cycles, which is a fraction of the runtime of a coarse-grained hardware accelerator. Recently, Silvén et al. identified that some important modern-day applications such as video and baseband processing can benefit from fine-grained hardware accelerators, if they are used instead of the traditional coarse-grained ones [3]. This finding has also shown to be very true in the upcoming Reconfigurable Video Coding (RVC) standard [4]. In RVC, existing and future video decoders are implemented by combining a set of standard video coding tools originating from a pre-defined library. These library components are rather fine-grained and if they are translated into hardware accelerators to achieve high performance, their execution times become very short. Also present-day applications such as MPEG-4 can benefit from fine-grained hardware accelerators as we will show later in this work. Fine-grained hardware acceleration brings up new problems that do not exist with traditional coarse-grained acceleration. Rintaluoma et al. showed that synchronization primitives such as interrupts and polling can create prohibitive overheads [5]. Moreover, all scheduling activities performed at run-time can slow down the system tremendously, since the scheduler needs to be invoked much more frequently than in coarse-grained systems. In this paper we shall introduce some very fast scheduling methods and discuss the areas where they can be used. It will also be shown how to apply run-time scheduling of fine-grained hardware accelerators to MPEG-4 video decoding. Finally, the performance of the shown algorithms is measured on a field-programmable gate array (FPGA) and the results are analyzed. A part of this work has been published previously in [6]. In addition to general improvements, this paper extends the previous work by offering a more theoretical analysis of the results, as well as new experimental results that have been acquired from running the experiments on a soft RISC processor. 1.2 MPEG-4 video decoding with fine-grained acceleration Fine-grained hardware acceleration is a variant of conventional hardware acceleration, where the3 accelerators implement only small-scale tasks with latencies of 100-1000 clock cycles. Figure 1 shows an example of a fine-grain accelerated system architecture with shared memory. Fine-grain accelerated functions can be, e.g., the computation of an 8x8 IDCT or interpolation, both of which can be done in less than 100 clock cycles [7, 8]. MPEG-4 video decoding is an important application for which fine-grained hardware accelerators can yield significant benefits [9]. Also, the fine granularity enables possibilities to accelerate general-purpose functions that can be shared by different applications, just like it is done in RVC [4]. An evident example would be a multi-standard video decoder chip, since different video codecs have plenty of common functions. Fine-grained hardware acceleration also enables new possibilities for better hardware utilization. For example in MPEG-4 video the contents of a macroblock can vary greatly, and a monolithic, statically scheduled macroblock decoder chip can end up running half-idle because some macroblocks are only


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-3-4-5 out of 16 pages.

Please select your school