Review: Multiprocessor BasicsCMP: Multiprocessors On One ChipMultithreading on A ChipTypes of MultithreadingMultithreaded Example: Sun’s Niagara (UltraSparc T1)Niagara Integer PipelineSimultaneous Multithreading (SMT)Threading on a 4-way SS Processor ExampleMulticore Xbox360 – “Xenon” processorXenon DiagramThe PS3 “Cell” Processor ArchitectureHow to make use of the SPEsWhat about the Software?CMP&SMT.1Review: Multiprocessor Basics# of ProcCommunication modelMessage passing 8 to 2048Shared addressNUMA 8 to 256UMA 2 to 64Physical connectionNetwork 8 to 256Bus 2 to 36Q1 – How do they share data?Q2 – How do they coordinate?Q3 – How scalable is the architecture? How many processors?CMP&SMT.2CMP: Multiprocessors On One ChipBy placing multiple processors, their memories and the IN all on one chip, the latencies of chip-to-chip communication are drastically reducedARM multi-chip coreSnoop Control UnitCPUL1$sCPUL1$sCPUL1$sCPUL1$sInterrupt DistributorCPUInterfaceCPUInterfaceCPUInterfaceCPUInterfacePer-CPU aliased peripheralsConfigurable between 1 & 4 symmetric CPUsPrivate peripheral busConfigurable # of hardware intrPrimary AXI R/W 64-b busOptional AXI R/W 64-b busI & D64-b busCCBPrivate IRQCMP&SMT.3Multithreading on A ChipFind a way to “hide” true data dependency stalls, cache miss stalls, and branch stalls by finding instructions (from other process threads) that are independent of those stalling instructions Multithreading – increase the utilization of resources on a chip by allowing multiple processes (threads) to share the functional units of a single processorProcessor must duplicate the state hardware for each thread – a separate register file, PC, instruction buffer, and store buffer for each threadThe caches, buffers can be shared (although the miss rates may increase if they are not sized accordingly)The memory can be shared through virtual memory mechanismsHardware must support efficient thread context switchingCMP&SMT.4Types of MultithreadingFine-grain – switch threads on every instruction issueRound-robin thread interleaving (skipping stalled threads)Processor must be able to switch threads on every clock cycleAdvantage – can hide throughput losses that come from both short and long stallsDisadvantage – slows down the execution of an individual thread since a thread that is ready to execute without stalls is delayed by instructions from other threadsCoarse-grain – switches threads only on costly stalls (e.g., L2 cache misses)Advantages – thread switching doesn’t have to be essentially free and much less likely to slow down the execution of an individual threadDisadvantage – limited, due to pipeline start-up costs, in its ability to overcome throughput loss-Pipeline must be flushed and refilled on thread switchesCMP&SMT.5Multithreaded Example: Sun’s Niagara (UltraSparc T1)Eight fine grain multithreaded single-issue, in-order cores (no speculation, no dynamic branch prediction)Ultra III NiagaraData width 64-b 64-bClock rate 1.2 GHz 1.0 GHzCache (I/D/L2)32K/64K/ (8M external)16K/8K/3MIssue rate 4 issue 1 issuePipe stages 14 stages 6 stagesBHT entries 16K x 2-b NoneTLB entries 128I/512D 64I/64DMemory BW 2.4 GB/s ~20GB/sTransistors 29 million 200 millionPower (max) 53 W <60 W4-way MT SPARC pipe4-way MT SPARC pipe4-way MT SPARC pipe4-way MT SPARC pipe4-way MT SPARC pipe4-way MT SPARC pipe4-way MT SPARC pipe4-way MT SPARC pipe Crossbar 4-way banked L2$Memory controllersI/Osharedfunct’sCMP&SMT.6Niagara Integer PipelineCores are simple (single-issue, 6 stage, no branch prediction), small, and power-efficientFetch Thrd Sel Decode Execute Memory WBI$ITLBInst bufx4PC logicx4DecodeRegFilex4Thread Select LogicALU Mul Shft DivD$DTLB Stbufx4Thrd Sel MuxThrd Sel MuxCrossbar InterfaceInstr typeCache missesTraps & interruptsResource conflictsCMP&SMT.7Simultaneous Multithreading (SMT)A variation on multithreading that uses the resources of a multiple-issue, dynamically scheduled processor (superscalar) to exploit both program ILP and thread-level parallelism (TLP)Most have more machine level parallelism than most programs can effectively use (i.e., than have ILP)With register renaming and dynamic scheduling, multiple instructions from independent threads can be issued without regard to dependencies among them-Need separate rename tables (ROBs) for each thread-Need the capability to commit from multiple threads (i.e., from multiple ROBs) in one cycleIntel’s Pentium 4 SMT called hyperthreadingSupports just two threads (doubles the architecture state)CMP&SMT.8Threading on a 4-way SS Processor ExampleThread A Thread BThread C Thread DTime →Issue slots →SMTFine MTCoarse MTCMP&SMT.9Multicore Xbox360 – “Xenon” processorTo provide game developers with a balanced and powerful platformThree SMT processors, 32KB L1 D$ & I$, 1MB UL2 cache165M transistors total3.2 Ghz Near-POWER ISA2-issue, 21 stage pipeline, with 128 128-bit registersWeak branch prediction – supported by software hintingIn order instructionsNarrow cores – 2 INT units, 2 128-bit VMX units, 1 of anything elseAn ATI-designed 500MZ GPU w/ 512MB of DDR3DRAM337M transistors, 10MB framebuffer48 pixel shader cores, each with 4 ALUsCMP&SMT.10Xenon Diagram Core 0L1D L1I Core 1L1D L1I Core 2L1D L1I 1MB UL2 512MB DRAM GPUBIU/IO Intf 3D Core 10MBEDRAMVideoOutMC0MC1AnalogChipXMA DecSMCDVDHDD PortFront USBs (2)WirelessMU ports (2 USBs)Rear USB (1)EthernetIRAudio OutFlashSystems ControlVideo OutCMP&SMT.11The PS3 “Cell” Processor ArchitectureComposed of a Non-SMP Architecture 234M transistors @ 4Ghz1 Power Processing Element, 8 “Synergistic” (SIMD) PE’s512KB L2 $ - Massively high bandwidth (200GB/s) bus connects it to everything elseThe PPE is strangely similar to one of the Xenon cores-Almost identical, really. Slight ISA differences, and fine-grained MT instead of real SMTThe real differences lie in the SPEs (21M transistors each)-An attempt to ‘fix’ the memory latency problem by giving each processor complete control over it’s own 256KB “scratchpad” – 14M transistors–Direct mapped for low latency-4 vector units per SPE, 1 of everything else – 7M trans.CMP&SMT.12How to make use of the SPEsCMP&SMT.13What about the Software?Makes use of special IBM “Hypervisor”Like an OS for OS’sRuns both a
View Full Document