WM CSCI 424 - Lecture Notes - D1581713

Home> Schools> College of William & Mary> Computer Science (CSCI) > CSCI 424> Lecture Notes

WM CSCI 424 - Lecture Notes

Pages 17

Download Save

Unformatted text preview:

The Big Picture: Where are We Now?Applications Needing “Supercomputing”Slide 4Slide 6Supercomputer Style Migration (Top500)Multiprocessor/Clusters Key QuestionsFlynn’s Classification SchemeSIMD ProcessorsExample SIMD MachinesMultiprocessor Basic OrganizationsShared Address (Shared Memory) Multi’sN/UMA Remote Memory Access Times (RMAT)Single Bus (Shared Address UMA) Multi’sSumming 100,000 Numbers on 100 ProcessorsSlide 18Message Passing MultiprocessorsSummaryMultiIntro.1The Big Picture: Where are We Now? ProcessorControlDatapathMemoryInputOutputInputOutputMemory ProcessorControlDatapathMultiprocessor – multiple processors with a single shared address spaceCluster – multiple computers (each with their own address space) connected over a local area network (LAN) functioning as a single systemMultiIntro.2Applications Needing “Supercomputing”Energy (plasma physics (simulating fusion reactions), geophysical (petroleum) exploration)DoE stockpile stewardship (to ensure the safety and reliability of the nation’s stockpile of nuclear weapons)Earth and climate (climate and weather prediction, earthquake, tsunami prediction and mitigation of risks)Transportation (improving vehicles’ airflow dynamics, fuel consumption, crashworthiness, noise reduction)Bioinformatics and computational biology (genomics, protein folding, designer drugs)Societal health and safety (pollution reduction, disaster planning, terrorist action detection)MultiIntro.4Encountering Amdahl’s LawSpeedup due to enhancement E isSpeedup w/ E = ---------------------- Exec time w/o EExec time w/ E Suppose that enhancement E accelerates a fraction F (F <1) of the task by a factor S (S>1) and the remainder of the task is unaffectedExTime w/ E = ExTime w/o E  ((1-F) + F/S) Speedup w/ E = 1 / ((1-F) + F/S)MultiIntro.6Examples: Amdahl’s LawConsider an enhancement which runs 20 times faster but which is only usable 25% of the time.Speedup w/ E = 1/(.75 + .25/20) = 1.31What if its usable only 15% of the time?Speedup w/ E = 1/(.85 + .15/20) = 1.17Amdahl’s Law tells us that to achieve linear speedup with 100 processors, none of the original computation can be scalar!To get a speedup of 99 from 100 processors, the percentage of the original program that could be scalar would have to be 0.01% or lessSpeedup w/ E = 1 / ((1-F) + F/S)MultiIntro.7Supercomputer Style Migration (Top500)In the last 8 years uniprocessor and SIMDs disappeared while Clusters and Constellations grew from 3% to 80%01002003004005001993199419951996199719981999200020012002200320042005ClustersConstellationsSIMDsMPPsSMPsUniproc'sNov datahttp://www.top500.org/lists/2005/11/Cluster – whole computers interconnected using their I/O busConstellation – a cluster that uses an SMP multiprocessor as the building blockMultiIntro.8Multiprocessor/Clusters Key QuestionsQ1 – How do they share data?Q2 – How do they coordinate?Q3 – How scalable is the architecture? How many processors can be supported?MultiIntro.9Flynn’s Classification SchemeSISD – single instruction, single data streamaka uniprocessor - what we have been talking about all semesterSIMD – single instruction, multiple data streamssingle control unit broadcasting operations to multiple datapathsMISD – multiple instruction, single datano such machine (although some people put vector machines in this category)MIMD – multiple instructions, multiple data streamsaka multiprocessors (SMPs, MPPs, clusters, NOWs)MultiIntro.10SIMD ProcessorsSingle control unitMultiple datapaths (processing elements – PEs) running in parallelQ1 – PEs are interconnected (usually via a mesh or torus) and exchange/share data as directed by the control unitQ2 – Each PE performs the same operation on its own local dataPEPEPEPE PEPEPEPE PEPEPEPE PEPEPEPEControlMultiIntro.11Example SIMD MachinesMaker Year # PEs # b/ PEMax memory (MB)PE clock (MHz)System BW (MB/s)Illiac IV UIUC 1972 64 64 1 13 2,560DAP ICL 1980 4,096 1 2 5 2,560MPP Goodyear 1982 16,384 1 2 10 20,480CM-2 Thinking Machines1987 65,536 1 512 7 16,384MP-1216 MasPar 1989 16,384 4 1024 25 23,000MultiIntro.12Multiprocessor Basic OrganizationsProcessors connected by a single busProcessors connected by a network# of ProcCommunication modelMessage passing 8 to 2048Shared addressNUMA 8 to 256UMA 2 to 64Physical connectionNetwork 8 to 256Bus 2 to 36MultiIntro.13Shared Address (Shared Memory) Multi’sUMAs (uniform memory access) – aka SMP (symmetric multiprocessors)all accesses to main memory take the same amount of time no matter which processor makes the request or which location is requestedNUMAs (nonuniform memory access)some main memory accesses are faster than others depending on the processor making the request and which location is requestedcan scale to larger sizes than UMAs so are potentially higher performanceQ1 – Single address space shared by all the processorsQ2 – Processors coordinate/communicate through shared variables in memory (via loads and stores)Use of shared data must be coordinated via synchronization primitives (locks)MultiIntro.14N/UMA Remote Memory Access Times (RMAT)Year Type Max ProcInterconnection NetworkRMAT (ns)Sun Starfire 1996 SMP 64 Address buses, data switch500Cray 3TE 1996 NUMA 2048 2-way 3D torus 300HP V 1998 SMP 32 8 x 8 crossbar 1000SGI Origin 3000 1999 NUMA 512 Fat tree 500Compaq AlphaServer GS1999 SMP 32 Switched bus 400Sun V880 2002 SMP 8 Switched bus 240HP Superdome 90002003 SMP 64 Switched bus 275NASA Columbia 2004 NUMA 10240 Fat tree ???MultiIntro.15Single Bus (Shared Address UMA) Multi’sCaches are used to reduce latency and to lower bus trafficMust provide hardware to ensure that caches and memory are consistent (cache coherencyMust provide a hardware mechanism to support process synchronizationProcessorProcessor ProcessorCache Cache CacheSingle BusMemoryI/OMultiIntro.16Summing 100,000 Numbers on 100 Processorssum[Pn] = 0;for (i = 1000*Pn; i< 1000*(Pn+1); i = i + 1)sum[Pn] = sum[Pn] + A[i];Processors start by running a loop that sums their subset of vector A numbers (vectors A and sum are shared variables, Pn is the processor’s number, i is a private variable)The processors then coordinate in adding together the partial sums (half is a private variable initialized to 100 (the number of processors))repeatsynch(); /*synchronize firstif

View Full Document


School:
Email:
New Password:
Confirm Password:

WM CSCI 424 - Lecture Notes

Sign up for free to view:

Please select your school