UW-Madison ME 964 - Chapter 1 Introduction - D421930

Home> Schools> University of Wisconsin, Madison> Mechanical Engineering (ME) > ME 964> Chapter 1 Introduction

DOC PREVIEW

UW-Madison ME 964 - Chapter 1 Introduction

School name University of Wisconsin, Madison

Course Me 964- High Performance Computing for Engineering Applications

Pages 45

This preview shows page 1-2-3-21-22-23-43-44-45 out of 45 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 45 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 45 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 45 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 45 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 45 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 45 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 45 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 45 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 45 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 45 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Chapter1-IntroductionChapter2-CudaProgrammingModelChapter3-CudaThreadingModelChapter4-CudaMemoryModel© David Kirk/NVIDIA and Wen-mei Hwu, 2006-2008 1 Chapter 1 Introduction Microprocessors based on a single central processing unit (CPU), such as those in the Intel Pentium family and the AMD Opteron family, drove rapid performance increases and cost reductions in computer applications for more than two decades. These microprocessors brought GFLOPS to the desktop and hundreds of GFLOPS to cluster servers. This relentless drive of performance improvement has allowed application software to perform more functionality, have better user interfaces, and generate more useful results with more. The users, in turn, demand even more improvements once they become accustomed to these improvements, creating a positive cycle for the computer industry. During the drive, most software developers have relied on the advances in hardware to increase the speed of their applications under the hood; the same software simply runs faster as each new generation of processors is introduced. This drive, however, has slowed since 2003 due to power consumption issues that limited the increase of the clock frequency and the level of productive activities that can be performed in each clock period within a single CPU. Since then, virtually all microprocessor vendors have switched to multi-core and many-core models where multiple processing units, referred to as processor cores, are used in each chip to increase the processing power. This switch has exerted a tremendous impact on the software developer community. Traditionally, the vast majority of software applications are written as sequential programs, as described by von Neumann in his seminal paper in 1947(?). For these sequential programs, their execution can be understood by sequentially stepping through the code. Historically, computer users have become accustomed to the expectation that these programs run faster with each new generation of microprocessors. Such expectation is no longer valid from this day onward. A sequential program will only run on one of the processor cores, which will not become any faster than those in use today. Without performance improvement, application developers will no longer be able to introduce new features and capabilities into their software as new microprocessors are introduced, reducing the growth opportunities of the entire computer industry. Rather, the applications software that will continue to enjoy performance improvement with each new generation of microprocessors will be parallel programs, in which multiple threads of execution cooperate to achieve the functionality faster. This new, dramatically© David Kirk/NVIDIA and Wen-mei Hwu, 2006-2008 2 escalated incentive for parallel program development has been referred to as the parallelism revolution [Larus ACM Queue article]. The practice of parallel programming is by no means new. The high-performance computing community has been developing parallel programs for decades. These programs run on large scale, expensive computers. Only a few elite applications can justify the use of these expensive computers, thus limiting the practice of parallel programming to a small number of application developers. Now that all new microprocessors are parallel computers, the number of applications that need to be developed as parallel programs has increased dramatically. There is now a great need for software developers to learn about parallel programming, which is the focus of this book. 1.1. GPUs as Parallel Computers Since 2003, a class of many-core processors called Graphics Processing Units (GPUs), have led the race for floating-point performance. This phenomenon is illustrated in Figure 1.1. While the performance improvement of general-purpose microprocessors has slowed significantly, the GPUs have continued to improve relentlessly. As of 2008, the ratio of peak floating-point calculation throughput between many-core GPUs and multi-core CPUs is about 10. These are not necessarily achievable speeds, merely the raw speed that the execution resources can potentially support in these chips: 367 gigaflops vs. 32 gigaflops. NVIDIA has subsequently delivered software driver and clock improvements that allow a G80 Ultra to reach 518 gigaflops, only about seven months after the original chip. In June 2008, NVIDIA introduced the GT200 chip, which delivers almost 1 teraflop (1,000 gigaflops) of single precision and almost 100 gigaflops of double precision performance. Quadro FX 5600NV35NV40G70G70-512G71Tesla C870NV303.0 GHzCore 2 Quad3.0 GHz Core 2 Duo3.0 GHz Pentium 4GeForce8800 GTX0100200300400500600Jan 2003 Jul 2003 Jan 2004 Jul 2004 Jan 2005 Jul 2005 Jan 2006 Jul 2006 Jan 2007 Jul 2007GFLOPS1 Based on slide 7 of S. Green, “ GPU Physics,” SIGGRAPH 2007 GPGPU Course. http://www.gpgpu.org/s2007/slides/15-GPGPU-physics.pdfFigure 1.1. Enlarging Performance Gap between GPUs and CPUs.© David Kirk/NVIDIA and Wen-mei Hwu, 2006-2008 3 Such a large performance gap between parallel and sequential processors has amounted to a significant “electrical potential build-up,” and at some point, something will have to give. We may be nearing that point now. To date, this large performance gap has already motivated many applications developers to move the computationally intensive parts of their software to GPU for execution. Not surprisingly, these computationally intensive parts are also the prime target of parallel programming – when there is more work to do, there is more opportunity to divide up the work amount cooperating threads of execution. One might ask why there is such a large performance gap between many-core GPUs and general-purpose multi-core CPUs. The answer lies in the differences in the fundamental design philosophies between the two types of processors, as illustrated in Figure 1.2. The design of a CPU is optimized for sequential code performance. It makes use of sophisticated control logic to allow instructions from a single thread of execution to execute in parallel or even out of their sequential order

View Full Document