DOC PREVIEW
Stanford EE 392C - Study Notes

This preview shows page 1-2-22-23 out of 23 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 23 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 23 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 23 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 23 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 23 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Scientific ApplicationsJing Jiang, Jayanth Gummaraju, Rohit Gupta4/19/2003 Scientific Applications 2Outline Application Study Vortex Architectural Issues Benchmarks4/19/2003 Scientific Applications 3Applicationstaken from Cray Research website: www.cray.com4/19/2003 Scientific Applications 4Interesting Example: Weather ForecastingThis slide has been taken from a course on Parallel Programming in Finland. It is very interesting becauseit shows that for predicting weather for the next day, using a 100MFLOPS machine would take about16 days! Using the fastest supercomputer today, NEC’s Earth Simulator, which has a peak performance of 40 TFLOPS, it would take about 10 hours!4/19/2003 Scientific Applications 5Example Application: Vortex N-body SimulationThe application models the evolution of vortices in in a 2-dimensional fluid. O (N2) Interactions All vortices have an effect on all other vortices. At each stage, it calculates O (N2) interactions among N vortices. Each Processor has N/P bodies The load of calculation is evenly distributed among P processors. At each stage of simulation, results are kept in two copies, one of the which goes to the other processors. They are actually shifted along a ring formed by all the processors. Whenever a new vortex enters the processor, the interactions of it with other stationary vortices at that processor are calculated Binary Tree Reduction After all of the stage simulations are done, the processors communicate in a binary tree with the root reporting the progress of the algorithm to the host computer. In this example: 4096 bodies 100 stages4/19/2003 Scientific Applications 6Code SizeIllustrated in the table are a wide range scientific applications running on 3 different machines. In the following slides, we will explore their behavior in terms of floating point operations, memory, I/0 and communication. They are:CLIMATE: global climate simulationSEMI: 3-D semiconductor device simulationMOLECULE: molecular dynamics simulationRENDER: 3-D perspective renderingEXFLOW: 3-D flow using adaptive gridsVortex: 2-D fluid dynamicsQCD: quantum chemical reaction dynamics4/19/2003 Scientific Applications 7Memory RequirementsTotal memory required for data varies from 193kb to 1.75gb and averages 665MB. Memory per processor required ranges from 900KB to 4MB. For programs like VORTEX, which fetches data at the beginning of the program and work on the same dataset during most of the execution, much smaller memory space is needed than some other programs, which are constantly fed with new data.4/19/2003 Scientific Applications 8Processing RequirementsThe single most notable common characteristic of the applications studied is that they are all floating point intensive. Investment in improving floating point operations would seem justifiable. However, everything else might scale as well and we will cover that issue in a later slide.4/19/2003 Scientific Applications 9I/O RequirementsAgain the numbers vary here. 2 categories:CLIMATE, SEMI, RENDER and REACT---- perform I/O regularly throughout the runMOLECULE, EXFLOW, QCD and VORTEX---- perform I/O only at the beginning and end of the run4/19/2003 Scientific Applications 10Communication RequirementsMost applications here were developed with the communication capabilities of the current generation of parallel machines in mind. As a result, it seems likely that future applications will have even larger communication loads than those presented here. In additions, communication demands tend to come in bursts, further increasing the need for high bandwidth in the communication network.4/19/2003 Scientific Applications 11General Characteristics Number Crunching Applications Typically have large number of arithmetic (floating point) operations. Typically large data and working sets However, this depends on the application. For example, Vortex has a small data set. Typically low temporal localityData once used, is not typically not reused in the near future. For example, in vortex the position of the vortices are loaded, the forces between them are computed and the positions change. This process is repeated for every stage. There is hardly any temporal locality as the position and forces are loaded atmost once every stage.Depending on regularity of application – can have high spatial locality4/19/2003 Scientific Applications 12Parallelism Lots of DLP, TLP, & ILP DLP Same operation performed on all bodies TLP Convert DLP to TLP More flexibility compared to DLP ILP Parallelism within “threads” Example: Vortex Mostly DLP DLP converted to TLPArchitectural Issues4/19/2003 Scientific Applications 14Performance TrendsScaling faster than Moore’s law!The graph below shows how supercomputers have scaled, and are projected to scale over period of 20 years. Starting from the Cray Y-MP machine at 1990, we can see that the expectation of supercomputer performance has grown faster than conventional desktops which scale with Moore’s law.This is because not only have the processors scaled themselves, but so has the number of processors in any given system, and the number of processing units in each chip.taken from Cray Research website: www.cray.com4/19/2003 Scientific Applications 15Processing RequirementsPetaflops?One of the standard measures of performance is the peak, or theoretical upper bound, of the number of floating point operations that can be executed per second. Several applications are now expected to need in the order of 1015FLOPS or Petaflops over the next few years. Currently, the fastest system is the NEC Earth simulator clocking approximately 40 TFLOPS.It is important to note that peak performance is not an accurate measure; It is the sustained performance, which can be drastically lower due to other bottlenecks in the system, which gives a true measure of performance. Some systems currently run at as low as 10-15% efficiency.Two approaches to achieve computational capacity:Cluster Systems: typically 100s-1000s of processors. Stream/Vector Systems: fewer custom designed highly powerful processors E.g. Stanford Streaming Supercomputer4/19/2003 Scientific Applications 16Interconnection NetworksBoth BW and latency important – In many systems, there is a potential to tradeoff BW for latency, or vice versa. It is not the case here since the data sets involved are often very large and so is the number


View Full Document
Download Study Notes
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Study Notes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Study Notes 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?