Unformatted text preview:

The Big Picture: Where are We Now?Applications Needing “Supercomputing”Slide 4Slide 6Supercomputer Style Migration (Top500)Multiprocessor/Clusters Key QuestionsFlynn’s Classification SchemeSIMD ProcessorsExample SIMD MachinesMultiprocessor Basic OrganizationsShared Address (Shared Memory) Multi’sN/UMA Remote Memory Access Times (RMAT)Single Bus (Shared Address UMA) Multi’sSumming 100,000 Numbers on 100 ProcessorsSlide 18Message Passing MultiprocessorsSummaryMultiIntro.1The Big Picture: Where are We Now? ProcessorControlDatapathMemoryInputOutputInputOutputMemory ProcessorControlDatapathMultiprocessor – multiple processors with a single shared address spaceCluster – multiple computers (each with their own address space) connected over a local area network (LAN) functioning as a single systemMultiIntro.2Applications Needing “Supercomputing”Energy (plasma physics (simulating fusion reactions), geophysical (petroleum) exploration)DoE stockpile stewardship (to ensure the safety and reliability of the nation’s stockpile of nuclear weapons)Earth and climate (climate and weather prediction, earthquake, tsunami prediction and mitigation of risks)Transportation (improving vehicles’ airflow dynamics, fuel consumption, crashworthiness, noise reduction)Bioinformatics and computational biology (genomics, protein folding, designer drugs)Societal health and safety (pollution reduction, disaster planning, terrorist action detection)MultiIntro.4Encountering Amdahl’s LawSpeedup due to enhancement E isSpeedup w/ E = ---------------------- Exec time w/o EExec time w/ E Suppose that enhancement E accelerates a fraction F (F <1) of the task by a factor S (S>1) and the remainder of the task is unaffectedExTime w/ E = ExTime w/o E  ((1-F) + F/S) Speedup w/ E = 1 / ((1-F) + F/S)MultiIntro.6Examples: Amdahl’s LawConsider an enhancement which runs 20 times faster but which is only usable 25% of the time.Speedup w/ E = 1/(.75 + .25/20) = 1.31What if its usable only 15% of the time?Speedup w/ E = 1/(.85 + .15/20) = 1.17Amdahl’s Law tells us that to achieve linear speedup with 100 processors, none of the original computation can be scalar!To get a speedup of 99 from 100 processors, the percentage of the original program that could be scalar would have to be 0.01% or lessSpeedup w/ E = 1 / ((1-F) + F/S)MultiIntro.7Supercomputer Style Migration (Top500)In the last 8 years uniprocessor and SIMDs disappeared while Clusters and Constellations grew from 3% to 80%01002003004005001993199419951996199719981999200020012002200320042005ClustersConstellationsSIMDsMPPsSMPsUniproc'sNov datahttp://www.top500.org/lists/2005/11/Cluster – whole computers interconnected using their I/O busConstellation – a cluster that uses an SMP multiprocessor as the building blockMultiIntro.8Multiprocessor/Clusters Key QuestionsQ1 – How do they share data?Q2 – How do they coordinate?Q3 – How scalable is the architecture? How many processors can be supported?MultiIntro.9Flynn’s Classification SchemeSISD – single instruction, single data streamaka uniprocessor - what we have been talking about all semesterSIMD – single instruction, multiple data streamssingle control unit broadcasting operations to multiple datapathsMISD – multiple instruction, single datano such machine (although some people put vector machines in this category)MIMD – multiple instructions, multiple data streamsaka multiprocessors (SMPs, MPPs, clusters, NOWs)MultiIntro.10SIMD ProcessorsSingle control unitMultiple datapaths (processing elements – PEs) running in parallelQ1 – PEs are interconnected (usually via a mesh or torus) and exchange/share data as directed by the control unitQ2 – Each PE performs the same operation on its own local dataPEPEPEPE PEPEPEPE PEPEPEPE PEPEPEPEControlMultiIntro.11Example SIMD MachinesMaker Year # PEs # b/ PEMax memory (MB)PE clock (MHz)System BW (MB/s)Illiac IV UIUC 1972 64 64 1 13 2,560DAP ICL 1980 4,096 1 2 5 2,560MPP Goodyear 1982 16,384 1 2 10 20,480CM-2 Thinking Machines1987 65,536 1 512 7 16,384MP-1216 MasPar 1989 16,384 4 1024 25 23,000MultiIntro.12Multiprocessor Basic OrganizationsProcessors connected by a single busProcessors connected by a network# of ProcCommunication modelMessage passing 8 to 2048Shared addressNUMA 8 to 256UMA 2 to 64Physical connectionNetwork 8 to 256Bus 2 to 36MultiIntro.13Shared Address (Shared Memory) Multi’sUMAs (uniform memory access) – aka SMP (symmetric multiprocessors)all accesses to main memory take the same amount of time no matter which processor makes the request or which location is requestedNUMAs (nonuniform memory access)some main memory accesses are faster than others depending on the processor making the request and which location is requestedcan scale to larger sizes than UMAs so are potentially higher performanceQ1 – Single address space shared by all the processorsQ2 – Processors coordinate/communicate through shared variables in memory (via loads and stores)Use of shared data must be coordinated via synchronization primitives (locks)MultiIntro.14N/UMA Remote Memory Access Times (RMAT)Year Type Max ProcInterconnection NetworkRMAT (ns)Sun Starfire 1996 SMP 64 Address buses, data switch500Cray 3TE 1996 NUMA 2048 2-way 3D torus 300HP V 1998 SMP 32 8 x 8 crossbar 1000SGI Origin 3000 1999 NUMA 512 Fat tree 500Compaq AlphaServer GS1999 SMP 32 Switched bus 400Sun V880 2002 SMP 8 Switched bus 240HP Superdome 90002003 SMP 64 Switched bus 275NASA Columbia 2004 NUMA 10240 Fat tree ???MultiIntro.15Single Bus (Shared Address UMA) Multi’sCaches are used to reduce latency and to lower bus trafficMust provide hardware to ensure that caches and memory are consistent (cache coherencyMust provide a hardware mechanism to support process synchronizationProcessorProcessor ProcessorCache Cache CacheSingle BusMemoryI/OMultiIntro.16Summing 100,000 Numbers on 100 Processorssum[Pn] = 0;for (i = 1000*Pn; i< 1000*(Pn+1); i = i + 1)sum[Pn] = sum[Pn] + A[i];Processors start by running a loop that sums their subset of vector A numbers (vectors A and sum are shared variables, Pn is the processor’s number, i is a private variable)The processors then coordinate in adding together the partial sums (half is a private variable initialized to 100 (the number of processors))repeatsynch(); /*synchronize firstif


View Full Document

WM CSCI 424 - Lecture Notes

Download Lecture Notes
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture Notes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture Notes 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?