Unformatted text preview:

CMSC 611: Advanced Computer ArchitectureParallel SystemsParallel ComputersDefinition: “A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast.”• Almasi and Gottlieb, Highly Parallel Computing ,1989Parallel machines are expected to have a bigger role in the future since:• Microprocessors are likely to remain dominant in the uniprocessor arena and the logical way to extend the performance is by connecting multiple microprocessors• It is not expected that the microprocessor technology will keep the pace of performance improvement given the increased level of complexity• There has been steady progress in software development for parallel architectures in recent years* Slide is a courtesy of Dave PattersonQuestions about parallel computers:How large a collection?How powerful are processing elements?How do they cooperate and communicate?How are data transmitted? What type of interconnection?What are HW and SW primitives for programmers?Does it translate into performance?Level of ParallelismBit-level parallelism• ALU parallelism: 1-bit, 4-bits, 8-bit, ...Instruction-level parallelism (ILP)• Pipelining, Superscalar, VLIW, Out-of-Order executionProcess/Thread-level parallelism• Divide job into parallel tasksJob-level parallelism• Independent jobs on one computer systemApplicationsScientific Computing • Nearly Unlimited Demand (Grand Challenge):• Successes in some real industries: – Petroleum: reservoir modeling– Automotive: crash simulation, drag analysis, engine– Aeronautics: airflow analysis, engine, structural mechanics– Pharmaceuticals: molecular modeling* Slide is a courtesy of Dave Patterson AppPerf (GFLOPS) Memory (GB)48 hour weather0.1 0.172 hour weather3 1Pharmaceutical design10010Global Change, Genome10001000Commercial ApplicationsTransaction processingFile serversElectronic CAD simulationLarge WWW serversWWW search enginesGraphics• Graphics hardware• Render FarmsFrameworkExtend traditional computer architecture with a communication architecture• abstractions (HW/SW interface)• organizational structure to realize abstraction efficientlyProgramming Model:• Multiprogramming: lots of jobs, no communication• Shared address space: communicate via memory• Message passing: send and receive messages• Data Parallel: several agents operate on several data sets simultaneously and then exchange information globally and simultaneously (shared or message passing)Communication Abstraction:• Shared address space: e.g., load, store, atomic swap• Message passing: e.g., send, receive library calls• Debate over this topic (ease of programming, scaling) → many hardware designs 1:1 programming modelTaxonomy of Parallel Architecture Flynn Categories• SISD (Single Instruction Single Data)• MISD (Multiple Instruction Single Data)• SIMD (Single Instruction Multiple Data)• MIMD (Multiple Instruction Multiple Data)* Slide is a courtesy of Dave PattersonSISDUniprocessorMISDNo commercial examplesApply same operations to a set of data• Find primes• Crack passwordsSIMDVector/Array computersSIMD ArraysMemoryProgramDataControllerPEDataPEDataPEDataPEDataPEDataPEDataPEDataPEDataPEDataPEDataPEDataPEDataPEDataPEDataPEDataPEDataPerformance keys• Utilization• CommunicationData Parallel ModelOperations performed in parallel on each element of a large regular data structure, such as an array• One Control Processor broadcast to many processing elements (PE) with condition flag per PE so that can skipFor distributed memory architecture data is distributed among memories• Data parallel model requires fast global synchronization• Data parallel programming languages lay out data to processor• Vector processors have similar ISAs, but no data placement restriction* Slide is a courtesy of Dave PattersonSIMD UtilizationConditional Execution• PE Enable– if (f<.5) {...}• Global PE enable check– while (t > 0) {...}MemoryProgramDataControllerPEDataPEDataPEDataPEDataPEDataPEDataPEDataPEDataPEDataPEDataPEDataPEDataPEDataPEDataPEDataPEDataPEf=1PEf=2PEf=1.5PEf=0PEf=3PEf=-1PEf=1PEf=0PEf=2.5PEf=2PEf=.2PEf=-3PEf=0PEf=-1PEf=-6PEf=0Communication: MasPar MP1Fast local X-netSlow global routingMemoryProgramDataControllerPEDataPEDataPEDataPEDataPEDataPEDataPEDataPEDataPEDataPEDataPEDataPEDataPEDataPEDataPEDataPEDataComunication: CM2Hypercube local routingWormhole global routingMemoryProgramDataControllerPEDataPEDataPEDataPEDataPEDataPEDataPEDataPEDataPEDataPEDataPEDataPEDataPEDataPEDataPEDataPEDataCommunication: PixelFlowDense connections within block• Single swizzle operation collects one word from each PE in block– Designed for antialiasing• NO inter-block connections• NO global routingMemoryProgramDataControllerPEDataPEDataPEDataPEDataPEDataPEDataPEDataPEDataPEDataPEDataPEDataPEDataPEDataPEDataPEDataPEDataMIMDMessage PassingShared memory/distributed memory• Uniform Memory Access (UMA)• Non-Uniform Memory Access (NUMA)Can support either SW model on either HW basisMessage passingProcessors have private memories, communicate via messagesAdvantages:• Less hardware, easier to design• Focuses attention on costly non-local operationsMessage Passing ModelEach PE has local processor, data, (I/O)• Explicit I/O to communicate with other PEs• Essentially NUMA but integrated at I/O vs. memory systemFree run between Send & Receive• Send + Receive = Synchronization between processes (event model)– Send: local buffer, remote receiving process/port– Receive: remote sending process/port, local bufferHistory of message passingEarly machines• Local communication• Blocking send & receiveLater: DMA with non-blocking sends• DMA for receive into buffer until processor does receive, and then data is transferred to local memoryLater still: SW libraries to allow arbitrary communicationExampleIBM SP-2, RS6000 workstations in racks• Network Interface Card has Intel 960• 8X8 Crossbar switch as communication building block• 40 MByte/sec per linkShared MemoryProcessors communicate with shared address spaceEasy on small-scale machinesAdvantages:• Model of choice for uniprocessors, small-scale multiprocessor• Ease of programming• Lower latency• Easier to use hardware controlled caching• Difficult to handle node failureCentralized Shared Memory Processors share a single centralized (UMA) memory through a bus interconnectFeasible for small processor count to limit


View Full Document

UMBC CMSC 611 - Parallel Systems

Download Parallel Systems
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Parallel Systems and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Parallel Systems 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?