DOC PREVIEW
UCLA COMSCI 239 - Data-Parallel Programming

This preview shows page 1-2-3 out of 9 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 9 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 9 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 9 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 9 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Presented at the GSPx Multicore Applications Conference, Santa Clara, October 31 to November 2, 2006. Data-Parallel Programming on the Cell BE and the GPU using the RapidMind Development Platform Michael D. McCool, RapidMind Inc. 470 Weber St N, Waterloo Ontario, Canada, N2L 6J2 Phone: +1 519 885-5455 Fax: +1 519 885-1463 [email protected] Abstract – The Cell BE processor is capable of achieving very high levels of performance via parallel computation. The processors in video accelerators, known as GPUs, are also high performance parallel processors. The RapidMind Development Platform provides a simple data-parallel model of execution that is easy to understand and learn, is usable from any ISO standard C++ program without any special extensions, maps efficiently onto the capabilities of both the Cell BE processor and GPUs, and can be extended to other multicore processors in the future. The RapidMind platform acts as an embedded programming language inside C++. It is built around a small set of types that can be used to capture and specify arbitrary computations. Arbitrary functions, including control flow, can be specified dynamically. Parallel execution is primarily invoked by applying these functions to arrays, generating new arrays. Access patterns on arrays allow data to be collected and redistributed. Collective operations, such as scatter, gather, and programmable reduction, support other standard parallel communication patterns and complete the programming model. I. INTRODUCTION The Cell BE processor is an example of a multicore processor, in which several processing elements are integrated onto a single physical chip. It includes nine cores, including a PowerPC core called the PPE and eight vector co-processors called SPEs. The PPE is designed for general purpose computing, including running an operating system, and includes standard features such as a cache. The SPEs, in contrast, are specialized for high performance numerical computation. They do not include a cache, but instead use a local memory and require explicit DMA transfers to and from a shared DRAM. The Cell BE is designed for high-performance parallel computing, and includes features to support several different styles of parallel computing: SIMD within a register (SWAR) and deep pipelining support small scale data parallelism; mailboxes and other synchronization and communication features support task parallelism; and a high-speed interconnect permits multiple Cell BE processors to be connected in order to build larger scale parallel machines. In addition to the Cell BE, several other multicore processor architectures are emerging. Standard CPUs from IBM, Intel, and AMD are now available in multicore versions. In short order even ordinary desktops may easily have 16 or more processor cores. Video accelerator cards, standard components of most desktop PCs, also support high-performance processors developed by NVIDIA and ATI called GPUs. GPUs are designed for real-time graphics and media manipulation but can also be used for arbitrary computation. High-end GPUs can include up to 48 cores. Both CPUs and GPUs also include SWAR parallelism. Unfortunately, use of a parallel processor does not automatically lead to higher application performance, especially when a larger number of cores are considered. Existing application code may not expose sufficient parallelism to scale well to larger numbers of cores. Some programming effort will often be required to take advantage of parallel processing. Parallel programming is different from and frequently more difficult than serial programming. Choices of algorithms, software architectures, languages, development tools, and programming techniques need to be reconsidered. These choices can have an impact not only on the performance of the resulting applications, but also on their robustness and portability. Parallel programming brings a new set of debugging challenges, including the possibility for deadlock, timing, and synchronization errors. Also, the hardware design space for parallel multicore processors is large and diverse, which means that code written at too low a level of abstraction will not be portable to another parallel machine. Software tools can substantially reduce the effort to parallelize code, and can provide a layer of abstraction that can enhance portability. Automatic parallelization of existing code would be an ideal solution. Unfortunately, automatic parallelization is fundamentally limited by the fact that the best parallel algorithm to solve a problem may be substantially different than the best serial algorithm solving the same problem. It is also hampered by the fact that existing programming languages were not designed to express parallel computations. Parallel execution of loops, for instance, can be prohibited by subtle language features such as pointer aliasing. This fact means that small and seemingly innocuous changes to a program can inhibit automatic parallelization. To express parallel algorithms in a maintainable way, while achieving consistent performance improvements, an explicit approach is desirable. At the same time, deploying completely new programming languages to express parallel computation would involve an unacceptably large retooling, conversion, and training cost. An incremental approach that builds on current practice would be more desirable. Our goal was to design a programming platform to address these challenges. The RapidMind Development Platform addresses multiple issues from the specification of parallel algorithms via a high-level abstraction, throughPresented at the GSPx Multicore Applications Conference, Santa Clara, October 31 to November 2, 2006. dynamic code generation for multiple specialized hardware targets, through runtime management of parallel processes. We want to enable developers to quickly build portable, high-performance applications using a simple, safe, and easy-to-use explicitly parallel programming model. In particular, we wanted to give developers access to the power of both the Cell BE and the GPU, but at a high level, so they can focus on developing efficient parallel algorithms instead of managing low-level, architecture-specific details. The RapidMind platform has several unique features. First, it has a C++ interface for describing computation, rather than a separate language. The platform interface is implemented using only ISO standard C++ features and works with


View Full Document

UCLA COMSCI 239 - Data-Parallel Programming

Download Data-Parallel Programming
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Data-Parallel Programming and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Data-Parallel Programming 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?