DOC PREVIEW
UW-Madison ME 964 - NVIDIA’s Fermi - The First Complete GPU Computing Architecture

This preview shows page 1-2-3-24-25-26 out of 26 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 26 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 26 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 26 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 26 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 26 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 26 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 26 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

NVIDIA’s Fermi: The First Complete GPU Computing Architecture A white paper by Peter N. Glaskowsky Prepared under contract with NVIDIA Corporation Copyright © September 2009, Peter N. Glaskowsky2 Peter N. Glaskowsky is a consulting computer architect, technology analyst, and professional blogger in Silicon Valley. Glaskowsky was the principal system architect of chip startup Montalvo Systems. Earlier, he was Editor in Chief of the award-winning industry newsletter Microprocessor Report. Glaskowsky writes the Speeds and Feeds blog for the CNET Blog Network: http://www.speedsnfeeds.com/ This document is licensed under the Creative Commons Attribution ShareAlike 3.0 License. In short: you are free to share and make derivative works of the file under the conditions that you appropriately attribute it, and that you distribute it only under a license identical to this one. http://creativecommons.org/licenses/by-sa/3.0/ Company and product names may be trademarks of the respective companies with which they are associated.3 Executive Summary After 38 years of rapid progress, conventional microprocessor technology is beginning to see diminishing returns. The pace of improvement in clock speeds and architectural sophistication is slowing, and while single-threaded performance continues to improve, the focus has shifted to multicore designs. These too are reaching practical limits for personal computing; a quad-core CPU isn’t worth twice the price of a dual-core, and chips with even higher core counts aren’t likely to be a major driver of value in future PCs. CPUs will never go away, but GPUs are assuming a more prominent role in PC system architecture. GPUs deliver more cost-effective and energy-efficient performance for applications that need it. The rapidly growing popularity of GPUs also makes them a natural choice for high-performance computing (HPC). Gaming and other consumer applications create a demand for millions of high-end GPUs each year, and these high sales volumes make it possible for companies like NVIDIA to provide the HPC market with fast, affordable GPU computing products. NVIDIA’s next-generation CUDA architecture (code named Fermi), is the latest and greatest expression of this trend. With many times the performance of any conventional CPU on parallel software, and new features to make it easier for software developers to realize the full potential of the hardware, Fermi-based GPUs will bring supercomputer performance to more users than ever before. Fermi is the first architecture of any kind to deliver all of the features required for the most demanding HPC applications: unmatched double-precision floating-point performance, IEEE 754-2008 compliance including fused multiply-add operations, ECC protection from the registers to DRAM, a straightforward linear addressing model with caching at all levels, and support for languages including C, C++, FORTRAN, Java, Matlab, and Python. With these features, plus many other performance and usability enhancements, Fermi is the first complete architecture for GPU computing.4 CPU Computing—the Great Tradition The history of the microprocessor over the last 38 years describes the greatest period of sustained technical progress the world has ever seen. Moore’s Law, which describes the rate of this progress, has no equivalent in transportation, agriculture, or mechanical engineering. Think how different the Industrial Revolution would have been 300 years ago if, for example, the strength of structural materials had doubled every 18 months from 1771 to 1809. Never mind steam; the 19th century could have been powered by pea-sized internal-combustion engines compressing hydrogen to produce nuclear fusion. CPU performance is the product of many related advances: • Increased transistor density • Increased transistor performance • Wider data paths • Pipelining • Superscalar execution • Speculative execution • Caching • Chip- and system-level integration The first thirty years of the microprocessor focused almost exclusively on serial workloads: compilers, managing serial communication links, user-interface code, and so on. More recently, CPUs have evolved to meet the needs of parallel workloads in markets from financial transaction processing to computational fluid dynamics. CPUs are great things. They’re easy to program, because compilers evolved right along with the hardware they run on. Software developers can ignore most of the complexity in modern CPUs; microarchitecture is almost invisible, and compiler magic hides the rest. Multicore chips have the same software architecture as older multiprocessor systems: a simple coherent memory model and a sea of identical computing engines. But CPU cores continue to be optimized for single-threaded performance at the expense of parallel execution. This fact is most apparent when one considers that integer and floating-point execution units occupy only a tiny fraction of the die area in a modern CPU. Figure 1 shows the portion of the die area used by ALUs in the Core i7 processor (the chip code-named Bloomfield) based on Intel’s Nehalem microarchitecture.5 Figure 1. Intel’s Core i7 processor (the chip code-named Bloomfield, based on the Nehalem microarchitecture) includes four CPU cores with simultaneous multithreading, 8MB of L3 cache, and on-chip DRAM controllers. Made with 45nm process technology, each chip has 731 million transistors and consumes up to 130W of thermal design power. Red outlines highlight the portion of each core occupied by execution units. (Source: Intel Corporation except red highlighting) With such a small part of the chip devoted to performing direct calculations, it’s no surprise that CPUs are relatively inefficient for high-performance computing applications. Most of the circuitry on a CPU, and therefore most of the heat it generates, is devoted to invisible complexity: those caches, instruction decoders, branch predictors, and other features that are not architecturally visible but which enhance single-threaded performance. Speculation At the heart of this focus on single-threaded performance is a concept known as speculation. At a high level, speculation encompasses not only speculative execution (in which instructions begin executing even before it is possible to know their results will be needed), but many other elements of CPU design.6 Caches, for example, are fundamentally speculative:


View Full Document

UW-Madison ME 964 - NVIDIA’s Fermi - The First Complete GPU Computing Architecture

Documents in this Course
Load more
Download NVIDIA’s Fermi - The First Complete GPU Computing Architecture
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view NVIDIA’s Fermi - The First Complete GPU Computing Architecture and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view NVIDIA’s Fermi - The First Complete GPU Computing Architecture 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?