Berkeley COMPSCI 258 - The Stanford FLASH Multiprocessor - D284504

Home> Schools> University of California, Berkeley> Computer Science (COMPSCI) > COMPSCI 258> The Stanford FLASH Multiprocessor

DOC PREVIEW

Berkeley COMPSCI 258 - The Stanford FLASH Multiprocessor

School name University of California, Berkeley

Course Compsci 258- Parallel Processors

Pages 12

This preview shows page 1-2-3-4 out of 12 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 12 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 12 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 12 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 12 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 12 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

The Stanford FLASH MultiprocessorJeffrey Kuskin, David Ofelt, Mark Heinrich, John Heinlein,Richard Simoni, Kourosh Gharachorloo, John Chapin, David Nakahira, Joel Baxter,Mark Horowitz, Anoop Gupta, Mendel Rosenblum, and John HennessyComputer Systems LaboratoryStanford UniversityStanford, CA 94305AbstractThe FLASH multiprocessorefficiently integrates support forcache-coherent shared memory and high-performance messagepassing, while minimizing both hardware and software overhead.Each node in FLASH contains a micropromssor, a portion of themachine’s globat memory, a port to the interconnection networkan I/O interface, and a custom node controller called MAGIC.The MAGIC chip handles all communication both within thenode and among nodes, using hsrdwired data paths for efficientdata movement and a programmable processor optimized forexecuting protocol operations. lhe use of the protocol processormakes FLASH very flexible — it can support a variety of differ-ent communication mechanisms — and simplifies the design andimplementation.This paper presents the architecture of FLASH and MAGIC,and discusses the base cache-coherence and message-passingprotocols. Latency and occupancy numbers, which are derivedfrom our system-level simulator and our Verilog code, are givenfor severrd common protocol operations. The paper alsodescribes our software strategy and FLASH’s current status.1 IntroductionThe two architectural techniques for communicatingdata among processom in a scalable multiprocessor aremessage passing and distributed shared memory (DSM).Despite significant differences in how programmers viewthese bvo architectural models, the underlying hardwaremechanisms used to implement these approaches havebeen converging. Current DSM and message-passingmultiprocessors consist of processing nodes intercon-nected with a high-bandwidth network. Each node con-tains a node processor, a portion of the physicallydistributed memory, and a node controller that connectsthe processor, memory, and network together. The princi-pal difference between message-passing and DSMmachines is in the protocol implemented by the node con-troller for transfeming data both within and among nodes.Perhaps more surprising than the similarity of the over-all structure of these types of machines is the commonalityin functions performed by the node controller. In bothcases, the primary performance-critical function of thenode controller is the movement of data at high bandwidthand low latency among the processor, memory, and net-work. In addition to these existing similarities, thearchi-tectural trends for both styles of machine favor furtherconvergence in both the hardware and software mecha-nisms used to implement the communication abstractions.Message-passing machines are moving to efficient supportof short messages and a uniform address space, featuresnormally associated with DSM machines. S~larly, DSMmachines are starting to provide support for message-likeblock transfers (e.g., the Cray T3D), a feature normrdlyassociated with message-passing machines.The efficient integration and support of both cache-coherent shared memory and low-overhead user-levelmessage passing is the primary goal of the FLASH @Lex-ible Architecture for SHared memory) multiprocessor.Efficiency involves both low hardware overhead and highperformance. A major problem of current cache-coherentDSM machines (such as the earlier DASH machineKLG+92]) is their high hardware overhead, while a majorcriticism of current message-passing machines is theirhigh software overhead for user-level message passing.FLASH integrates and streamlines the hardware primitivesneeded to provide low-cost and high-performance supportfor global cache coherence and message passing. We aimto achieve this support without compromising the protec-tion model or the ability of an operating system to controlresource usage. The tatter point is important since we wantFLASH to operate well in a general-purpose mnMpro-grammed environment with many users sharing themachine as well as in a traditional supercomputer environ-ment.To accomplish these goals we are designing a customnode controller. This controller, called MAGIC (MemoryAnd General Interconnect Controller), is a highly inte-grated chip that implements all data transfers both within3021063-6897/94 $03.00 @ 1994 IEEEFigure 2.1. FLASH system architecture.the node and between the node and the network. To deliverhigh performance, the MAGIC chip contains a specializeddata path optimized to move data between the memory,network processor, and 1/0 ports in a pipelined fashionwithout redundant copying. To provide the flexible controlneeded to support a variety of DSM and message-passingprotocols, the MAGIC chip contains an embedded proces-sor that controls the data path and implements the proto-col. The separate data path allows the processor to updatethe protocol data structures (e.g., the directory for cachecoherence) in parallel with the associated data transfers.This paper describes the FLASH design and rationale.Section 2 gives an overview of FLASH. Section 3 brieflydescribes two example protocols, one for cache-coherentshared memory and one for message passing. Section 4presents the microarchitecture of the MAGIC chip.Section 5 briefly presents our system software strategy andSection 6 presents our implementation strategy and cur-rent status. Section 7 discusses related work and we con-clude in Section 8.2 FLASH Architecture OverviewFLASH is a single-address-space machine consisting ofa large number of processing nodes connected by a low-latency, high-bandwidth interconnection network. Everynode is identical (see Figure 2. 1), containing a high-per-formance off-the-shelf microprocessor with its caches, aportion of the machine’s distributed main memory, and theMAGIC node controller chip. The MAGIC chip forms theheart of the node, integrating the memory controller, 1/0controller, network interface, and a programmable proto-col processor. This integration allows for low hardwareoverhead while supporting both cache-coherence and mes-sage-passing protocols in a scalable and cohesive fashion.1The MAGIC architecture is designed to offer both flex-ibility and high performance. First, MAGIC includes aprogrammable protocol processor for flexibili~y. Second,MAGIC’s central location within the node ensures that itsees all processor, network, and I/O transactions, allowingit to control all node resources and support a variety

View Full Document