Lecture 18: Introduction to MultiprocessorsWhy Multiprocessors?Exploiting (Program) ParallelismExploiting (Program) Parallelism -2Need for Parallel ComputingWhat to do with a billion transistors ?Elements of a multiprocessing systemUse, GranularityTopologyCouplingControl/DataTask allocation and routingReconfigurationProgrammer’s modelParallel Programming ModelsMessage Passing MulticomputersShared-Memory MultiprocessorsCache Coherence - A Quick OverviewImplementation issuesPerformance objectivesFlynn’s Taxonomy of MultiprocessingExamplesPredominant ApproachesC62x Pipeline Operation Pipeline PhasesSuperscalar: PowerPC 604 and Pentium ProIA-64 aka EPIC aka VLIWPhillips Trimedia ProcessorTMS320C6201 Revision 2TMS320C6701 DSP Block DiagramTMS320C67x CPU CoreSingle-Chip Multiprocessors CMPIntel IXP1200 Network ProcessorIXP1200 MicroEngineIXP1200 Instruction SetUCB: Processor with DRAM (PIM) IRAM, VIRAMIRAM Vision StatementPotential Multimedia ArchitectureRevive Vector (= VSIW) Architecture!V-IRAM1: 0.18 µm, Fast Logic, 200 MHz 1.6 GFLOPS(64b)/6.4 GOPS(16b)/16MBTentative VIRAM-1 FloorplanTentative VIRAM-”0.25” FloorplanStanford: Hydra DesignMescal ArchitectureOutlineArchitectural Rationale and MotivationSlide 46Architecture GoalsArchitecture TemplateRange of ArchitecturesSlide 50Slide 51Slide 52Slide 53Range of Architectures (Future)The RAW ArchitectureSlide 56RAW Machine OverviewRAW TilesRAW Tiles(cont.)Configurable Hardware in RAWBenefits of RAWDisadvantages of RAWTraditional Operations on RAWCompiling for RAW machinesCompiling for RAW(cont.)Structure of RAWCCThe MAPS SystemSpace-Time SchedulerBasic Block OrchestratorInitial Code TransformationInstruction PartitionerGlobal Data PartitionerData and Instruction PlacerEvent SchedulerControl FlowPerformanceFuture WorkReconfigurable processorsSCORE Stream Computation Organized for Reconfigurable ExecutionOpportunityProblemIntroduce: SCOREViewpointSlide 84…borrows heavily from...Enabling HardwareBRASS ArchitectureArray ModelPlatform VisionExample: SCORE ExecutionSpatial ImplementationSerial ImplementationSummary: Elements of a multiprocessing systemConclusions1Lecture 18:Introduction to MultiprocessorsPrepared and presented by:Kurt Keutzerwith thanks for materials fromKunle Olukotun, Stanford; David Patterson, UC Berkeley2Why Multiprocessors?NeedsRelentless demand for higher performance»Servers»NetworksCommercial desire for product differentiationOpportunitiesSilicon capabilityUbiquitous computers3Exploiting (Program) ParallelismInstructionLoopThreadProcessLevels of ParallelismGrain Size (instructions)1 10 100 1K 10K 100K 1M4Exploiting (Program) Parallelism -2InstructionLoopThreadProcessLevels of ParallelismGrain Size (instructions)1 10 100 1K 10K 100K 1MBit5Need for Parallel ComputingDiminishing returns from ILP»Limited ILP in programs»ILP increasingly expensive to exploitPeak performance increases linearly with more processors»Amhdahl’s law appliesAdding processors is inexpensive»But most people add memory alsoDie AreaPerformanceDie AreaPerformanceP+M2P+M2P+2M6What to do with a billion transistors ?Technology changes the cost and performance of computer elements in a non-uniform manner»logic and arithmetic is becoming plentiful and cheap»wires are becoming slow and scarceThis changes the tradeoffs between alternative architectures»superscalar doesn’t scale well–global control and dataSo what will the architectures of the future be?20072004200119981 clk3 (10, 16, 20?) clks64 x the area4x the speedslower wires7Elements of a multiprocessing systemGeneral purpose/special purposeGranularity - capability of a basic module Topology - interconnection/communication geometry Nature of coupling - loose to tightControl-data mechanismsTask allocation and routing methodology Reconfigurable»Computation»InterconnectProgrammer’s model/Language support/ models of computationImplementation - IC, Board, Multiboard, NetworkedPerformance measures and objectives[After E. V. Krishnamurty - Chapter 58Use, GranularityGeneral purposeattempting to improve general purpose computation (e.g. Spec benchmarks) by means of multiprocessingSpecial purposeattempting to improve a specific application or class of applications by means of multiprocessingGranularity - scope and capability of a processing element (PE)Nand gate ALU with registersExecution unit with local memoryRISC R1000 processor9TopologyTopology - method of interconnection of processors BusFull-crossbar switchMeshN-cubeTorusPerfect shuffle, m-shuffleCube-connected componentsFat-trees10CouplingRelationship of communication among processors Shared clock (Pipelined)Shared registers (VLIW)Shared memory (SMM)Shared network11Control/DataWay in which data and control are organizedControl - how the instruction stream is managed (e.g. sequential instruction fetch)Data - how the data is accessed (e.g. numbered memory addresses) Multithreaded control flow - explicit constructs: fork, join, wait, control program flow - central controllerDataflow model - instructions execute as soon as operands are ready, program structures flow of data, decentralized control12Task allocation and routingWay in which tasks are scheduled and managedStatic - allocation of tasks onto processing elements pre-determined before runtimeDynamic - hardware/software support allocation of tasks to processors at runtime13ReconfigurationComputationalrestructuring of computational elements»reconfigurable - reconfiguration at compile time»dynamically reconfigurable- restructuring of computational elements at runtimeInterconnection schemeswitching network - software controlledreconfigurable fabric14Programmer’s modelHow is parallelism expressed by the user?Expressive powerProcess-level parallelism»Shared-memory»Message-passingOperator-level parallelismBit-level parallelismFormal guaranteesDeadlock-freeLivelock freeSupport for other real-time notionsException handling15Parallel Programming ModelsMessage Passing»Fork thread–Typically one per node»Explicit communication–Send messages–send(tid, tag, message)–receive(tid, tag, message)»Synchronization–Block on messages (implicit sync)–BarriersShared Memory (address space)»Fork thread–Typically one per node»Implicit communication–Using shared address space–Loads and stores»Synchronization–Atomic
View Full Document