Berkeley COMPSCI 258 - Convergence of Parallel Architectures - D64932

Home> Schools> University of California, Berkeley> Computer Science (COMPSCI) > COMPSCI 258> Convergence of Parallel Architectures

Berkeley COMPSCI 258 - Convergence of Parallel Architectures

School name University of California, Berkeley

Course Compsci 258- Parallel Processors

Pages 5

Download Save

Unformatted text preview:

CS258 S991NOW Handout Page 1Convergence of Parallel ArchitecturesCS 258, Spring 99David E. CullerComputer Science DivisionU.C. Berkeley1/26/99 CS258 S99 L2 2Recap of Lecture 1• Parallel Comp. Architecture driven by familiartechnological and economic forces– application/platform cycle, but focused on the most demanding applications– hardware/software learning curve• More attractive than ever because ‘best’ building- the microprocessor - is also the fastest BB.• History of microprocessor architecture is– translates area and denisty into performance• The Future is higher levels of parallelism– Parallel Architecture concepts apply at many levels– Communication also on exponential curve=> Quantitative Engineering approachNew ApplicationsMore PerformanceSpeedup1/26/99 CS258 S99 L2 3Application SoftwareSystem SoftwareSIMDMessage PassingShared MemoryDataflowSystolicArraysArchitectureHistory• Parallel architectures tied closely toprogramming models– Divergent architectures, with no predictable pattern of– Mid 80s rennaisance1/26/99 CS258 S99 L2 4Plan for Today• Look at major programming models– where did they come from?– The 80s architectural rennaisance!– What do they provide?– How have they converged?• Extract general structure and fundamental• Reexamine traditional camps from newperspective (next week)SIMDMessage PassingShared MemoryDataflowSystolicArraysGenericArchitecture1/26/99 CS258 S99 L2 5Administrivia• Mix of HW, Exam, Project load• HW 1 due date moved out to Fri 1/29– added 1.18• Hands-on session with parallel machines in 1/26/99 CS258 S99 L2 6Programming Model• Conceptualization of the machine thatprogrammer uses in coding applications– How parts cooperate and coordinate their activities– Specifies communication and synchronization operations• Multiprogramming– no communication or synch. at program level• Shared address space– like bulletin board• Message passing– like letters or phone calls, explicit point to point• Data parallel:– more regimented, global actions on data– Implemented with shared address space or message passingCS258 S992NOW Handout Page 21/26/99 CS258 S99 L2 7Shared Memory => Shared Addr. Space• Bottom-up engineering factors• Programming concepts• Why its attactive.1/26/99 CS258 S99 L2 8Adding Processing Capacity• Memory capacity increased by adding modules• I/O by controllers and devices• Add processors for processing!– For higher-throughput multiprogramming, or parallelI/O ctrlMem Mem MemInter connectMemI/O ctrlProcessor ProcessorInterconnectI/Odevices1/26/99 CS258 S99 L2 9Historical DevelopmentPPCCI/OI/OM MM MPPCI/OM MCI/O$ $• “Mainframe” approach– Motivated by multiprogramming– Extends crossbar used for Mem and I/O– Processor cost-limited => crossbar– Bandwidth scales with p– High incremental cost» use multistage instead• “Minicomputer” approach– Almost all microprocessor systems have bus– Motivated by multiprogramming, TP– Used heavily for parallel computing– Called symmetric multiprocessor (SMP)– Latency larger than for uniprocessor– Bus is bandwidth bottleneck» caching is key: coherence problem– Low incremental cost1/26/99 CS258 S99 L2 10Shared Physical Memory• Any processor can directly reference any• Any I/O controller - any memory• Operating system can run on any processor, or– OS uses shared memory to coordinate• Communication occurs implicitly as result of• What about application processes?1/26/99 CS258 S99 L2 11Shared Virtual Address Space• Process = address space plus thread of control• Virtual-to-physical mapping can be establishedso that processes shared portions of address– User-kernel or multiple processes• Multiple threads of control on one address– Popular approach to structuring – Now standard application capability (ex: POSIX threads)• Writes to shared address visible to other threads– Natural extension of uniprocessors model– conventional memory operations for communication– special atomic operations for synchronization» also load/stores1/26/99 CS258 S99 L2 12Structured Shared Address Space• Add hoc parallelism used in system code• Most parallel applications have structured SAS• Same program on each processor– shared variable X means the same thing to each threadSt or eP1P2PnP0Lo ad0 pr i vat eP1 pr i vat eP2 pr i vat ePn pr i va te Virtual address spaces for acollection of processes communicatingvia shared addressesMachine physical address spaceShared portionof address spacePrivate portionof address spaceCommon physicaladdressesCS258 S993NOW Handout Page 31/26/99 CS258 S99 L2 13Engineering: Intel Pentium Pro Quad– All coherence andmultiprocessing glue inprocessor module– Highly integrated, targeted athigh volume– Low latency and bandwidthP-Pro bus (64-bit data, 36-bit address, 66 MHz)CPUBus interfaceMIUP-PromoduleP-PromoduleP-Promodule256-KBL2 $InterruptcontrollerPCIbridgePCIbridgeMemorycontroller1-, 2-, or 4-wayinterleaved DRAMPCI busPCI busPCII/Ocards1/26/99 CS258 S99 L2 14Engineering: SUN Enterprise• Proc + mem card - I/O card– 16 cards of either type– All memory accessed over bus, so symmetric– Higher bandwidth, higher latency busGigaplane bus (256 data, 41 address, 83 MHz)SBUSSBUSSBUS2 FiberChannel100bT, SCSIBus interfaceCPU/memcardsP$2$P$2$Mem ctrlBus interface/switchI/O cards1/26/99 CS258 S99 L2 15Scaling Up– Problem is interconnect: cost (crossbar) or bandwidth (bus)– Dance-hall: bandwidth still scalable, but lower cost than crossbar» latencies to memory uniform, but uniformly large– Distributed memory or non-uniform memory access (NUMA)» Construct shared address space out of simple messagetransactions across a general-purpose network (e.g. read-request, read-response)– Caching shared (particularly nonlocal) data?M M M° ° °° ° °M° ° °M MNetworkNetworkP$P$P$P$P$P$“Dance hall” Distributed memory1/26/99 CS258 S99 L2 16Engineering: Cray T3E– Scale up to 1024 processors, 480MB/s links– Memory controller generates request message for non-local references– No hardware mechanism for coherence» SGI Origin etc. provide thisSwitchP$XYZExternal I/OMemctrland NIMem1/26/99 CS258 S99 L2 17SIMDMessage PassingShared MemoryDataflowSystolicArraysGenericArchitectureM° ° °M MNetworkP$P$P$1/26/99 CS258 S99 L2 18Message Passing Architectures• Complete computer as building block, including I/O–

View Full Document


School:
Email:
New Password:
Confirm Password:

Berkeley COMPSCI 258 - Convergence of Parallel Architectures

Sign up for free to view:

Please select your school