UW-Madison ME 964 - Overview of Parallel Computing Hardware and Execution Models - D421074

Home> Schools> University of Wisconsin, Madison> Mechanical Engineering (ME) > ME 964> Overview of Parallel Computing Hardware and Execution Models

DOC PREVIEW

UW-Madison ME 964 - Overview of Parallel Computing Hardware and Execution Models

School name University of Wisconsin, Madison

Course Me 964- High Performance Computing for Engineering Applications

Pages 43

This preview shows page 1-2-3-20-21-22-41-42-43 out of 43 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 43 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 43 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 43 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 43 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 43 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 43 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 43 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 43 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 43 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 43 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

ME964High Performance Computing for Engineering ApplicationsThe Internet is a great way to get on the net. - Senator Bob Dole.© Dan Negrut, 2011ME964 UW-MadisonOverview of Parallel ComputingHardware and Execution ModelsJanuary 27, 2011Before We Get Started… Last time Wrap up overview of C programming Start overview of parallel computing Focused primarily on the limitations with the sequential computing model These limitations and Moore’s law usher in the age of parallel computing Today Discuss parallel computing models, hardware and software Start discussion about GPU programming and CUDA Thank you, to those of you who took the time to register for auditing2The CPU Speed - Memory Latency GapCourtesy of Elsevier, Computer Architecture, Hennessey and Patterson, fourth edition3The memory baseline is 64 KB DRAM in 1980 with a 1.07/year improvement in latency.CPU speed improved at 1.25/year till 1986, 1.52/year until 2004, and 1.2/year thereafter.Vision of the Future  “Parallelism for Everyone” Parallelism changes the game A large percentage of people who provide applications are going to have to care about parallelism in order to match the capabilities of their competitors. PerformanceGHz EraTime Multi-core EraGrowing gap!competitive pressures = demand for parallel applicationsPresentation Paul Petersen,Sr. Principal Engineer, IntelISV: Independent Software Vendors 4Intel Larrabee and Knights Ferris Paul Otellini, President and CEO, Intel "We are dedicating all of our future product development to multicore designs"  "We believe this is a key inflection point for the industry." 5Larrabee a thing of the past now.Knights Ferry and Intel’s MIC (Many Integrated Core) architecture with 32 cores for now. Public announcement: May 31, 2010Putting things in perspective…Slide Source: Berkeley View of Landscape6The way business has been run in the past It will probably change to this…Increasing clock frequency is primary method of performance improvementProcessors parallelism is primary method of performance improvementDon’t bother parallelizing an application, just wait and run on much faster sequential computerNobody is building one processor per chip. This marks the end of the La-Z-Boy programming eraLess than linear scaling for a multiprocessor is failureGiven the switch to parallel hardware, even sub-linear speedups are beneficial as long as you beat the sequentialEnd: Discussion of Computational Models and Trends Beginning: Overview of HW&SW for Parallel Computing7Amdhal’s Law“A fairly obvious conclusion which can be drawn at this point is that the effort expended on achieving high parallel processing rates is wasted unless it is accompanied by achievements in sequential processing rates of very nearly the same magnitude”Excerpt from “Validity of the single processor approach to achieving large scale computing capabilities,” by Gene M. Amdhal, in Proceedings of the “AFIPS Spring Joint Computer Conference,” pp. 483, 19678Amdahl’s Law[Cntd.] Sometimes called the law of diminishing returns In the context of parallel computing used to illustrate how going parallel with a part of your code is going to lead to overall speedups The art is to find for the same problem an algorithm that has a large rp Sometimes requires a completely different angle of approach for a solution Nomenclature: algorithms for which rp=1 are called “embarrassingly parallel”9Example: Amdhal’s Law Suppose that a program spends 60% of its time in I/O operations, pre and post-processing The rest of 40% is spent on computation, most of which can be parallelized Assume that you buy a multicore chip and can throw 6 parallel threads at this problem. What is the maximum amount of speedup that you can expect given this investment? Asymptotically, what is the maximum speedup that you can ever hope for?10A Word on “Scaling” Algorithmic Scaling of a solution algorithm You only have a mathematical solution algorithm at this point Refers to how the effort required by the solution algorithm scales with the size of the problem Examples: Naïve implementation of the N-body problem scales like O(N2), where N is the number of bodies Sophisticated algorithms scale like O(N·logN)  Gauss elimination scales like the cube of the number of unknowns in your linear system Scaling on an implementation on a certain architecture Intrinsic Scaling: how the wall-clock run time increase with the size of the problem Strong Scaling: how the wall-clock run time of an implementation changes when you increase the processing resources Weak Scaling: how the wall-clock run time changes when you increase the problem size but also the processing resources in a way that basically keeps the ration of work/processor constant Order of relevance: strong, intrinsic, weak A thing you should worry about: is the Intrinsic Scaling similar to the Algorithmic Scaling? If Intrinsic Scaling significantly worse than Algorithmic Scaling:  You might have an algorithm that thrashes the memory badly, or You might have a sloppy implementation of the algorithm 11Overview of Large Multiprocessor Hardware Configurations12Courtesy of Elsevier, Computer Architecture, Hennessey and Patterson, fourth editionNewton13Newton: 24 GPU Cluster~ Hardware Configurations ~Head NodeCompute Node 6Compute Node 5Compute Node 4Compute Node 3Lab ComputersCompute Node 2Compute Node 1CPU 0Intel Xeon 5520CPU 1Intel Xeon 5520RAM48 GB DDR3Hard Disk1TBGPU 0Infiniband Card QDRGPU 1GPU 2GPU 3Tesla C10604GB RAM240 CoresPCIEx16 2.0Ethernet ConnectionFast Infiniband ConnectionFast InfinibandSwitchEthernetRouterLegend, Connection Type:InternetRemoteUser 2RemoteUser 1RemoteUser 3Compute Node ArchitectureGigabit Ethernet SwitchNetwork-Attached StorageSome Nomenclature Shared addressed space: when you invoke address “0x0043fc6f” on one machine and then invoke “0x0043fc6f” on a different machine they actually point to the same global memory space Issues: memory coherence Fix: software-based or hardware-based Distributed addressed space: the opposite of the above Symmetric Multiprocessor (SMP): you have one machine that shares amongst all its processing units a certain amount of memory (same address space) Mechanisms should be in place to prevent data hazards (RAW, WAR, WAW). Goes back to memory coherence Distributed shared

View Full Document