DOC PREVIEW
Berkeley COMPSCI C267 - Challenges of Future High-End Computing

This preview shows page 1-2-24-25 out of 25 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 25 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 25 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 25 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 25 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 25 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Challenges of Future High-End ComputingDavid H. BaileyNERSC, Lawrence Berkeley LabMail Stop 50B-2239Berkeley, CA [email protected] Petaflops Computer Systemu1 Pflop/s (1015 flop/s) in computing power.uBetween 10,000 and 1,000,000 processors.uBetween 10 Tbyte and 1 Pbyte main memory (1 Pbyte =100 times thecapacity of the U. C. Berkeley library).uBetween 1 Pbyte and 100 Pbyte on-line mass storage.uBetween 100 Pbyte and 10 Ebyte archival storage.uCommensurate I/O bandwidth, etc.uIf built today, would cost $50 billion and consume 1,000 Mwatts ofelectric power.uMay be feasible and “affordable” by the year 2010 or sooner.Petaflops Computers:Who Needs Them?Expert predictions:u(c. 1950) Thomas J. Watson: only about six computers are neededworldwide.u(c. 1977) Seymour Cray: there are only about 100 potential customersworldwide for a Cray-1.u(c. 1980) IBM study: only about 50 Cray-class computers will be soldper year.Present reality:uSome private homes now have six Cray-1-class computers!Applications for Petaflops SystemsuNuclear weapons stewardship.uCryptology and digital signal processing.uSatellite data processing.uClimate and environmental modeling.uDesign of advanced aircraft and spacecraft.uDesign of practical fusion energy systems.uPattern matching in DNA sequences.u3-D protein molecule simulations.uGlobal-scale economic modeling.uVirtual reality design tools for molecular nanotechnology.Plus numerous novel applications that can only be dimly envisioned now.SIA Semiconductor Technology RoadmapCharacteristic1999 2001 2003 2006 2009Feature size (micron) 0.18 0.15 0.13 0.10 0.07DRAM size (Mbit) 256 1024 1024 4096 16KRISC processor (MHz)1200 1400 1600 2000 2500Transistors (millions) 21 39 77 203 521Cost per transistor (ucents) 1735 1000 580 255 100Observations:uMoore’s Law of increasing density will continue until at least 2009.uClock rates of RISC processors and DRAM memories are not expectedto be more than about twice today’s rates.Conclusion: Future high-end systems will feature tens of thousands ofprocessors, with deeply hierarchical memories.Designs for a Petaflops SystemCommodity technology design:u100,000 nodes, each of which is a 10 Gflkop/s processor.uClock rate = 2.5 GHz; each processor can do four flop per clock.uMulti-stage switched network.Hybrid technology, multi-threaded (HTMT) design:u10,000 nodes, each with one superconducting RSFQ processor.uClock rate = 100 GHz; each processor sustains 100 Gflop/s.uMulti-threaded processor design handles a large number of outstandingmemory references.uMulti-level memory hierarchy (CRAM, SRAM, DRAM, etc.).uOptical interconnection network.Little’s Law of Queuing TheoryLittle’s Law:Average number of waiting customers =average arrival rate x average wait time per customer.Proof:Define f(t) = cumulative number of arrived customers, and g(t) =cumulative number of departed customers. Assume f(0) = g(0) = 0, andf(T) = g(T) = N. Consider the region between f(t) and g(t). By Fubini’stheorem of measure theory, one can evaluate this area by integrationalong either axis. Thus Q T = D N, where Q is average length ofqueue, and D is average delay per customer. In other words, Q = (N/T)D.Little's Law0501.4682.7674.945.055.6046.3587.297.7877.9858.4139.27110.06410.91412.48513.17813.75914.24714.53515.20515.95216.2917.14119.64521.54623.24324.413Elapsed timeLittle’s Law of High Performance ComputingAssume:uSingle processor-memory system.uComputation deals with data in local main memory.uPipeline between main memory and processor is fully utilized.Then by Little’s Law, the number of words in transit between CPU andmemory (i.e. length of vector pipe, size of cache lines, etc.)= memory latency x bandwidth.This observation generalizes to multiprocessor systems:concurrency = latency x bandwidth,where “concurrency” is aggregate system concurrency, and“bandwidth” is aggregate system memory bandwidth.This form of Little’s Law was first noted by Burton Smith of Tera.Little’s Law and Petaflops ComputingAssume:uDRAM memory latency = 100 ns.uThere is a 1-1 ratio between memory bandwidth (word/s) and sustainedperformance (flop/s).uCache and/or processor system can maintain sufficient outstandingmemory references to cover latency.Commodity design:Clock rate = 2.5 GHz, so latency = 250 CP. Then system concurrency= 100,000 x 4 x 250 = 108.HTMT design:Clock rate = 100 GHz, so latency = 10,000 CP. Then systemconcurrency = 10,000 x 10,000 = 108.But by Little’s Law, system concurrency = 10-7 x 1015 = 108 in each case.Amdahl’s Law and Petaflops ComputingAssume:uCommodity petaflops system -- 100,000 CPUs, each of which cansustain 10 Gflop/s.u90% of operations can fully utilize 100,000 CPUs.u10% can only utilize 1,000 or fewer processors.Then by Amdahl’s Law,Sustained performance < 1015 / [0.9/105 + 0.1/103]= 9.2 x 1012 flop/s,which is only about 1% of the system’s presumed achievableperformance.Concurrency and Petaflops ComputingConclusion: No matter what type of processor technology is used,applications on petaflops computer systems must exhibit roughly 100million way concurrency at virtually every step of the computation, orelse performance will be disappointing.uThis assumes that most computations access data from local DRAMmemory, with little or no cache re-use (typical of many applications).uIf substantial long-distance communication is required, the concurrencyrequirement may be even higher!Key question: Can applications for future systems be structured to exhibitthese enormous levels of concurrency?Latency and Data Locality LatencySystemSec. ClocksSGI O2, local DRAM320 ns 62SGI Origin, remote DRAM 1us 200IBM SP2, remote node 40 us 3,000HTMT system, local DRAM 50 ns 5,000HTMT system, remote memory200 ns 20,000SGI cluster, remote memory 3 ms 300,000Algorithms and Data LocalityuCan we quantify the inherent data locality of key algorithms?uDo there exist “hierarchical” variants of key algorithms?uDo there exist “latency tolerant” variants of key algorithms?uCan bandwidth-intensive algorithms be substituted for latency-sensitivealgorithms?uCan Little’s Law be “beaten” by formulating algorithms that accessdata lower in the memory hierarchy? If so, then systems such asHTMT can be used effectively.A Hierarchical, Latency Tolerant Algorithmfor Large 1-D FFTsuRegard input data of length n = p q as a p x q complex matrix,distributed so that each


View Full Document

Berkeley COMPSCI C267 - Challenges of Future High-End Computing

Documents in this Course
Lecture 4

Lecture 4

52 pages

Split-C

Split-C

5 pages

Lecture 5

Lecture 5

40 pages

Load more
Download Challenges of Future High-End Computing
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Challenges of Future High-End Computing and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Challenges of Future High-End Computing 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?