1Page1Multiprocessors(orParallelComputers)(orParallelProcessors)ParallelComputers• Definition:“Aparallelcomputerisacollectionofprocessingelementsthatcooperateandcommunicatetosolvelargeproblemsfast.”Almasi andGottlieb,HighlyParallelComputing,1989• Questionsaboutparallelcomputers:– Howlargeacollection?– Howpowerfulareprocessingelements?– Howdotheycooperateandcommunicate?– Howaredatatransmitted?– Whattypeofinterconnection?– WhatareHWandSWprimitivesforprogrammer?– Doesittranslateintoperformance?ParallelProcessors“Religion”• Thedreamofcomputerarchitectssince1960:replicateprocessorstoaddperformancevs.designafasterprocessor• Ledtoinnovativeorganizationtiedtoparticularprogrammingmodelssince“uniprocessors can’tkeepgoing”– e.g.,uniprocessors muststopgettingfasterduetolimitofspeedoflight:1972,… ,1989– Bordersreligiousfervor:youmustbelieve!– Fervordampedsomewhen1990scompanieswentoutofbusiness:ThinkingMachines,KendallSquare,...• Argumentinsteadisthe“pull” ofopportunityofscalableperformance,notthe“push” ofuniprocessor performanceplateauOpportunities:ScientificComputing• NearlyUnlimitedDemand(GrandChallenge):App Perf (GFLOPS) Memory(GB)48hourweather 0.1 0.172hourweather 3 1Pharmaceuticaldesign 100 10GlobalChange,Genome 1000 1000Successesinsomerealindustries:– Petroleum:reservoirmodeling– Automotive:crashsimulation,draganalysis,engine– Aeronautics:airflowanalysis,engine,structuralmechanics– Pharmaceuticals:molecularmodeling– Entertainment:fulllengthmovies(“ToyStory”)2Page2Example:ScientificComputing• MolecularDynamicsonIntelParagonwith128processors(1994)• Improveovertime:loadbalancing,other• 128processorIntelParagon=406MFLOPS• C90vector=145MFLOPS(or- 45Intelprocessors)Opportunities:CommercialComputing•Transactionprocessing&TPC-Cbenchmark– smallscaleparallelprocessorstolargescale•Others:Fileservers,electronicCADsimulation(multipleprocesses),WWWsearchenginesWhatlevelParallelism?• Bitlevelparallelism:1970to-1985– 4bits,8bit,16bit,32bitmicroprocessors• Instructionlevelparallelism(ILP):1985throughtoday– Pipelining– Superscalar– VLIW– Out-of-Orderexecution– LimitstobenefitsofILP?• ProcessLevelorThreadlevelparallelism;mainstreamforgeneralpurposecomputing– Serversareparallel– Highend DesktopdualprocessorPCFlynn’sTaxonomyofParallelArchitectures• SingleInstructionSingleData(SISD)– uniprocessor• SingleInstructionMultipleData(SIMD)– Illiac-IV,CM-2• MultipleInstructionSingleData(MISD)• MultipleInstructionMultipleData(MIMD)MIMDadvantages• MIMDsaremoreflexible– Functionassingleand/ormultiprogrammedmachine• MIMDs arecosteffective– Off-the-shelfmicroprocessors3Page3DataParallelModel(SIMD)• Operationscanbeperformedinparalleloneachelementofalargeregulardatastructure,suchasanarray• 1ControlProcessorbroadcasttomanyPEs– ConditionflagperPEsothatcanskip• Datadistributedineachmemory• DataparallelprogramminglanguageslayoutdatatoprocessorDataParallelModel(SIMD)• SIMDledtoDataParallelProgramminglanguages• AdvancingVLSIledtosinglechipFPUs andwholefastµProcs (SIMDlessattractive)• SIMDprogrammingmodelledtoSingleProgramMultipleData(SPMD)model– Allprocessorsexecuteidenticalprogram• Dataparallelprogramminglanguagesstilluseful,docommunicationallatonce:“BulkSynchronous” phasesinwhichallcommunicateafteraglobalbarrierSmall-ScaleMIMD• Memory:centralizedwithuniformaccesstime(“UMA”)andbusinterconnect• Examples: SPARCCenter,ChallengeLarge-ScaleMIMD• Memory:distributedwith nonuniform accesstime(“NUMA”)andscalableinterconnect(distributedmemory)• Examples:T3D,Exemplar,Paragon,CM-54Page4CommunicationModelsforNUMA• DistributedSharedMemory(DSM)– Processorscommunicatewithsharedaddressspace– Easyonsmall-scalemachines– Advantages:» Modelofchoicefor uniprocessors,small-scaleMPs» Easeofprogramming• Messagepassing– Processorshaveprivate memoriesseparateaddressspace– communicateviamessages– Advantages:» Lesshardware,easiertodesign» Focusesattentiononcostlynon-local operationsSharedAddressModel• Eachprocessorcannameeveryphysicallocationinthemachine• Eachprocesscannamealldataitshareswithotherprocesses• Datatransfervialoadandstore• Datasize:byte,word,...orcacheblocks• Usesvirtualmemorytomapvirtualtolocalorremotephysical• Memoryhierarchymodelapplies:nowcommunicationmovesdatatolocalproc.cache(asloadmovesdatafrommemorytocache)MessagePassingModel• Wholecomputer(CPU,memory,I/Odevices)communicateasexplicitI/Ooperations– EssentiallyNUMAbutintegratedatI/Odevicesvs.memorysystem• Sendspecifieslocalbuffer+receivingprocessonremotecomputer• Receivespecifiessendingprocessonremotecomputer+localbuffertoplacedata–
View Full Document