DOC PREVIEW
Berkeley COMPSCI 258 - The J-Machine Multicomputer

This preview shows page 1-2-3-4 out of 12 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

The J-Machine Multicomputer: An Architectural Evaluation*Michael D. Noakes, Deborah A. Wallach, and William J. DallyArtificial Intelligence Laboratory and Laboratory for Computer ScienceMassachusetts Institute of TechnologyCambridge, Massachusetts 02139noakes@ ai.mit.edu, kerr@ ai.mit.edu, [email protected] MIT J-Machine multicomputer has been con-structed to study the role of a set of primitive mechanismsin providing efficient support for parallel computing. EachJ-Machine node consists of an integrated multicomputercomponent, the Message-Driven Processor (MDP), and 1MByte of DRAM. The MDP provides mechanisms to sup-port efficient communication, synchronization, and nam-ing. A 512 node J-Machine is operational and is due tobe expanded to 1024 nodes in March 1993. In this pa-per we discuss the design of the J-Machine and evaluatethe effectiveness of the mechanisms incorporated into theMDP. We measure the performance of the communicationand synchronization mechanisms directly and investigatethe behavior of four complete applications.1 IntroductionOver the past 40 years, sequential von Neumann proces-sors have evolved a set of mechanisms appropriate for sup-porting most sequential programming models. It is clear,however, from efforts to build concurrent machines byconnecting many sequential processors, that these highly-evolved sequential mechanisms are not adequate to supportmost parallel models of computation. These mechanismsdo not efficiently support synchronization of threads, com-munication of dam or global naming of objects. As aresult, these functions, inherent to any parallel model ofcomputation, must be implemented largely in software withprohibitive overhead.T%e J-Machine project [5] was developed to study howto best apply modem VLSI technology to construct a multi-computer. Each processing node of the J-Machine consistsof a Message-Driven Processor (MDP) and 1 MByte ofDRAM. The MDP incorporates a 36-bit integer processor(32 bits of data augmented with 4 bits of tag), a memory*The research described in this paper was supported in part by the De-fense Advanced Research Projects Agency under contracts NWO14-88K-0738 and F19628-92C-O045, and by a National Science Foundation Pres-idential Young Investigator Award, grant MIP-8657531, witlr matchingfunds from General Electric Corporation, IBM Corporation, and AT&T.management unit, a router for a 3-D mesh network, a net-work interface, a 4K-word x 36-bit SRAM, and an ECCDRAM controller in a single 1.lM transistor VLSI chip.Rather than being specialized for a single model of com-putation, the MDP incorporates primitive mechanisms forcommunication, synchronization, and naming that permitit to efficiently support threads with 50 to 150 instructionswhich exchange small data objects frequently with low-latency and synchronize quickly. A 512 node J-Machine isin daily use at MIT and will be expanded to 1024 nodes inMarch 1993.This paper describes a range of experiments performedon the J-Machine to study the effectiveness of the selectedmechanisms in supporting parallel applications. These ex-periments are divided into micro-benchmarks, designed toisolate the effects of the primitive mechanisms, and macro-benchrmwks, to demonstrate the cumulative effect of themechanisms on application level codes. We investigate thesequential performance of the MDP, the message-passingmechanisms, the performance of the 3D-mesh network,and the behavior of parallel applications running on theJ-Machine.We use these studies to critique the effectiveness ofthe mechanisms and reflect on the impact of these designdecisions in developing programming systems for the J-Machine. We contrast the effectiveness of the J-Machinewith comparable multicomputers and consider the impactof alternative mechanisms to further enhance efficiency.2 The J-MachineThis section describes the architecture of the J-Machineand the hardware prototype on which the studies were per-formed.2.1 ArchitectureTheinstruction set of the MDP includes the usual arith-metic, data movement, and control instructions. The MDPis unique in providing special support for communication,synchronization, and naming.22408S4-7495/93 $3.0001993 IEEEThe MDP supports communication using a set ofs endinstructions for message formatting, a fast network for de-livery, automatic message buffering, and task creation uponmessage arrival. A series ofs end instructions is used toinject messages at a rateof up to 2 words per cycle. The for-mat of a message is arbkwy except that the first word mustcontain the address of the code to run at the destination andthe length of the message. Messages are routed through the3D-mesh network using deterministic, e-cube, wormholerouting [4]. The channel bandwidth is 0.5 words/cycle andthe minimum latency is 1 cycle/hop. Upon arrival, mes-sages are buffered in a hardware queue. When a messagearrives at the head of the queue, a task is dispatched tohandle it in four processor cycles. During these cycles theInstruction Pointer is loaded from the message header, anaddress register is set to point to the new message so thatthe thread’s arguments may be accessed, and the thread’sfirst instruction is fetched and decoded.Messages may be issued to one of two priorities. Priorityone messages receive preference during channel arbitration,are buffered in a separate queue at the destination, and aredispatched before pending priority zero messages. Priorityone threads may interrupt executing priority zero threads.There is also a background priority that runs wheneverboth message queues are empty. Fast interrupt processingis achieved through the use of three distinct register sets.Synchronization is provided by the ability to signalevents effectively using the low-latency communicationprimitives and by the use of data-tagging in both the regis-ter tile and memory. Two of the possible sixteen data types,cf ut and f ut, are used to mark slots for values that havenot yet been computed. If a thread attempts to read a slotbefore the value has been supplied, the processor will trapto a system routine to suspend the thread until the valueis delivered. In this event, the arrival of the value is usedto restart the thread, The cf ut type provides inexpensivesynchronization on a single slot, much like a full-emptybit. The fut type may be copied without faulting andthus supports the more flexible, but more expensive, futuredatatype [2], Futures are tirst-class data objects and refer-ences to them


View Full Document

Berkeley COMPSCI 258 - The J-Machine Multicomputer

Documents in this Course
Load more
Download The J-Machine Multicomputer
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view The J-Machine Multicomputer and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view The J-Machine Multicomputer 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?