DOC PREVIEW
UT EE 382V - The Stanford Dash Multiprocessor

This preview shows page 1-2 out of 7 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 7 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 7 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 7 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

EE382V: Computer Architecture: User System Interplay Lecture #17Department of Electical and Computer EngineeringThe University of Texas at Austin Monday, 2 April 2007Disclaimer: ”The contents of this document are scribe notes for The University ofTexas at Austin EE382V Spring 2007, Computer Architecture: User System Interplay∗.The notes capture the class discussion and may contain erroneous and unverified infor-mation and comments.The Stanford Dash MultiprocessorLecture #17: Monday, 26 March 2007Lecturer: Mattan ErezScribe: Carl OlsonReviewer: Mattan Erez, Min Kyu JeongThe Computer Systems Laboratory at Stanford University developed a shared-memorymultiprocessor called Dash (an abbreviation for Directory Architecture for Shared Mem-ory). The fundamental premise behind the architecture is that it is possible to build ascalable high-performance machine with a single address space and coherent caches [1].Dash accomplishes this by using a distributed directory at clusters of processors with ahierarchical bus structure. Although the Dash Multiprocessor was a research investiga-tion of highly parallel architectures, a working prototype with 64 processors was builtand tested at Stanford before the publishing of the paper in March 1992.Reference:[1] Lenoski et al., ”The Stanford Dash Multiprocessor”, IEEE Computer, Volume 25,Issue 3 (March 1992).1 What is the problem being solved?• Leverage commodity microprocessors and build a scalable high-performance ma-chine with a single address space and coherent caches.• Provide scalability that achieves linear or near-linear performance with tens ofprocessors to thousands of processors while maintaining the price/performance ad-vantage of an individual processor.• Investigate parallel architectures without imposing excessive programming diffi-culty. Do not want to complicate the programming model.∗Copyright 2007 Carl Olson and Mattan Erez, all rights reserved. This work may be reproduced andredistributed, in whole or in part, without prior written permission, provided all copies cite the originalsource of the document including the names of the copyright holders and ”The University of Texas atAustin EE382V Spring 2007, Computer Architecture: User System Interplay”.2 EE382V: Lecture #17• Cache-coherent shared address machines do not scale effectively. Snooping schemescause common bus and individual processor caches to saturate.• Problems of data partitioning and dynamic load distribution in parallel machines.2 Who are the intended users?• Application Programmers porting sequential applications to parallel machines.• Users of current and future parallel applications.• General Purpose Processing requiring high total system performance.• Users of message passing machines because Dash can emulate message passingmachines.2.1 Intended Readers• Operating System developers.• Automatically parallelizing compiler writers.• Computer architects defining scalable parallel architectures.• Researchers developing interconnection networks. Dash claims to be independent,but high point-to-point interconnection, that can’t broadcast, increases latency andbandwidth, making it bad for Dash. Interconnection is not explicit in paper, butDash is not really independent of interconnect type.• Parallel language developers to exploit underlying multicore hardware.• Researchers interested in evaluation techniques for their parallel architectures.• Readers interested in a brief overview of cache coherency and consistency concepts.3 What is unique about the suggested solution?• First operational machine to include scalable cache-coherence.• Coherence protocol exploiting partitioned and distributed directories rather than acentralized one.• Heirarchical (cluster based) cache and bus model. Combination of point-to-pointand bus.EE382V: Lecture #17 3• Distributed memory to exploit locality. Private data can be made local to thecluster to avoid long latency remote reference.• Coherence traffic is directory based point-to-point messaging. Directories allow youto send invalidates out to only the caches that need them.• Prefetching in directory, update-write and deliver primitives, for handling blockingread-misses. The update-write operation sends the new data directly to all pro-cessors that have cached the data, while the deliver operation sends the data tospecified clusters [1].• Synchronization techiques: queue-based locks and fetch-and-increment operations.The queue-based locks in Dash use the directory to indicate which processors arespinning on the lock. Fetch-and-increment operations have low serialization andare useful for implementing several synchronization primitives such as barriers,distributed loops, and work queues [1].• Creation of split-transactional bus protocol through masking of remote requests.• Request and Reply Meshes eliminate request-reply deadlocks.• Does not guarantee deadlock-free operation but deadlock avoidance through remoteaccess cache buffering and NACKs.• Shared address space allows users to assume uniform memory model.• Supports multiple write invalidation overlaps due to non-blocking writes.4 How is the idea evaluated?• Simulated and built the Dash system. Three applications used to measure perfor-mance.• Performance Monitoring Modules were used.• In order to have lesser memory for the directory, invalidations per shared writeis analyzed in just two applications. The results show that the bit-vector basedtracking can be replaced with a few pointers and limited broadcasts to keep trackof nodes and their accesses.• Unloaded read miss latency in no contention is shown. What about number of eachhierarchical misses in applications?4 EE382V: Lecture #175 Was the evaluation in line with the stated userrequirements?• Not a very good job of evaluation considering the claims. Too many claims withlimited results (at least the ones shown).• Would like to see number of local misses, remote misses in applications not justduration of one unloaded read miss.• Would like to see how Dash scales with different interconnection networks becausethe claim is that Dash is independent of interconnection.• Would like to see how Dash OS performed on the Dash hardware.• Does not really show contention of applications.• Some general purpose applications, like MP3D, do not scale well with more proces-sors.• Evaluation of programming difficulty not done.• Comparison of a message


View Full Document

UT EE 382V - The Stanford Dash Multiprocessor

Documents in this Course
Load more
Download The Stanford Dash Multiprocessor
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view The Stanford Dash Multiprocessor and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view The Stanford Dash Multiprocessor 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?