Berkeley COMPSCI 258 - The DASH Prototype: Implementation and Performance - D1093255

Home> Schools> University of California, Berkeley> Computer Science (COMPSCI) > COMPSCI 258> The DASH Prototype: Implementation and Performance

DOC PREVIEW

Berkeley COMPSCI 258 - The DASH Prototype: Implementation and Performance

School name University of California, Berkeley

Course Compsci 258- Parallel Processors

Pages 12

This preview shows page 1-2-3-4 out of 12 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 12 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 12 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 12 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 12 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 12 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

The DASH Prototype: Implementation and Performance Daniel Lenoski, James Laudon, Truman Joe, David Nakahira, Luis Stevens, Anoop Gupta and John Hennessy Computer Systems Laboratory Stanford University, CA 94305 Abstract The fundamental premise behind the DASH project is that it is fea- sible to build large-scale shared-memory multiprocessors with hardware cache coherence. While paper studies and software sirn- ulators are useful for understanding many high-level design trade- offs, prototypes are essential to ensure that no critical details are overlooked. A prototype provides convincing evidence of the fea- sibility of the design allows one to accurately estimate both the hardware and the complexity cost of various features. and provides a platform for studying real workloads. A 16-processor prototype of the DASH multiprocessor has been operational for the last six months. In this paper, the hardware overhead of directory-based cache coherence in the prototype is examined. We also discuss the performance of the system. and the speedups obtained by parallel applications running on the prototype. Using a sophisticated harcl- were performance monitor, we characterize the effectiveness of coherent caches and the relationship between an application’s ref- erence behavior and its speedup. 1.0 Introduction For parallel architectures to achieve widespread usage it is impor- tant that they efficiently run a wide variety of applications without excessive programming difficulty. To maximize both high perfor- mance and wide applicability, we believe a parallel architecture should provide (i) the ability to support hundreds to thousands of processors, (ii) high-perform.ance individual processors. and (iii) a single shared address space. One important question that arises in the &sign of such large-scale single-address-space machines is whether or not to allow caching of shared writeable data. The advantage, of course, is that caching allows higher performance to be achieved by reducing memory latency; the disadvantage is the problem of cache coherence. While solutions to the cache coherence problem are well under- stood for small-scale multiprocessors. they are unfortunately not so clear for large-scale machines. In fact, large-scale machines cur- rently do not suppon cache coherence. and it has not been clear what the benefits and costs will be. For the past several years, the DASH (Directory Architecture for SHared memory) project has been exploring the feasibility of Perm~ssmn IO copy wlthout fee all or pan of this material is granted provided that the copies are not made or dlstrlbuted for direct commercial advantage. the ACM copynpht nonce and the title of the publication and IIS date appear, and notlce IS gwen that copymg IS by pernussion of the Associanon for Compuunp Machmery. To copy otherwse. or IO republish. requwes a fee and/or specific pemnsslon. 0 1992 ACM 0-89791.509-7/92/0005/0092 $1.50 building large-scale single-address-space machines with coherent caches. The key ideas are to distribute the main memory among the processing nodes to provide scalable memory bandwidth, and to use a distributed directory-base-d protocol to support cache coherence. To evaluate these ideas. we have constructed a proto- type DASH machine. The full prototype will consist of sixty-four 33MHz MIPS R3OOO/R3010 processors. delivering up to 1600 MIPS and 600 scalar MFLOPS. An initial 16-processor prototype has been working for the past several months, and we are currently expanding this to the full 64-processor configuration. This paper examines the hardware cost and performance character- istics of the prototype DASH system. Cost is measured in terms of the logic gates and the bytes of dynamic and static memory in the base system and the added directory logic. Performance is mea- sured in terms of memory system bandwidth and latency, and in terms of parallel application speedups. For a represmtative set of the measured applications, we also present detailed reference sta- tistics and relate these statistics to the observed application speed- ups. Finally. we describe the StrucNre of the performance monitor logic which was used to take the detailed reference measurements. The paper is organized as follows. Section 2 gives an overview of the DASH architecture. Section 3 innoduces the DASH prototype and describes the logic used for the directory-based coherence pro- tocol. Section 4 details the hardware costs of the system. Section 5 outlines the StNCNre and function of the performance monitor logic, and Section 6 presents the performance of the memory sys- tem. and the speedups obtained by parallel applications NtUthtg on the prototype. We conclude in Section 7 with a summary of our experience with the DASH prototype.. 2.0 The DASH Architecture The DASH architecture has a two-level structure shown in Figure 1. At the top level. the architecture consists of a set of pro- cessing no&s (clusters) connected through a mesh interconnection network. In turn, each processing node is a bus-based multiproces- sor. Intra-cluster cache cohermce is implemented using a snoopy bus-based protocol, while inter-cluster coherence is maintained through a distributed directory-based protocol. The cluster functions as a high-performance processing no&. In addition the grouping of multiple processors on a bus within each cluster amortizes the cost of the directory logic and the network interface. This grouping also reduces the directory memory requirements by keeping track of cached lines at a cluster as opposed to processor level. (We will more concretely discuss the role of clustering in reducing overhead in Section 4).Figure 1. Block diagram of a 2x2 DASH prototype. The directory-based protocol implements an invalidation-based coherence scheme. A memory location may be in one of three states: rutcached. that is not cached by any processing node at all; shared. that is in an unmodified state in the caches of one or more nodes: or dirty, that is modified in the cache of some

View Full Document