Unformatted text preview:

Cashmere 2L Software Coherent Shared Memory on a Clustered Remote Write Network Robert Stets Sandhya Dwarkadas Nikolaos Hardavellas Galen Hunt Leonidas Kontothanassis Srinivasan Parthasarathy and Michael Scott Department of Computer Science t DEC Cambridge Research Lab University of Rochester One Kendall Sq Bldg 700 Rochester NY 14627 0226 Cambridge MA 02139 cashmere cs rochester edu 1 Introduction Abstract The shared memory programming model provides ease of use for parallel applications Unfortunately while small scale hnrdware cache coherent symmetric multiprocessors SMPs arc now widely available in the market larger hardware coherent machines are typically very expensive Software techniques based on virtunl memory have been used to support a shared memory programming model on a network of commodity workstations 3 6 12 14 171 In general however the high latencies of traditional networks have resulted in poor performance relative to hardware shared memory for applications requiring frequent communication Recent technological advances are changing the equation LOWlatency remote write networks such as DEC s Memory Chnnnel ll provide the possibility of transparent and inexpcnsivo shared memory These networks allow processors in one nodo to modify the memory of another node safely from user space with very low microsecond latency Given economies of scale a clustered system of small scale SMPs on a low latency network is becoming a highly attractive platform for large shared memory parallel programs particularly in organizations that already own tho hardware SMP nodes reduce the fraction of coherence operations that must be handled in software A low latency network rcduccs thetimethattheprogrammustwaitforthoseoperations to complete While software shared memory has been an active area of rcsearch for many years it is only recently that protocols for clustered systems have begun to be developed 7 10 13 22 The challenge for such a system is to take advantage of hardware shared memory for sharing within an SMP and to ensure that software overhead is incurred only when actively sharing data across SMPs in the cluster This challenge is non trivial the straightfotward two level approach arrange for each SMP node of a clustered system to play the role of a single processor in a non clustered system suffers from a serious problem it requires the processors within a node lo synchronizevery frequently e g every time one of them exchanges coherence information with another node Our Cashmere 2L system is designed to capitalize on both intranode cache coherence and low latency inter node messages All processorsonanodesharethesamephysicalframe for asharcddata page We employ a moderately lazy VM based implementation of release consistency with multiple concurrent writers directories home nodes and page size coherence blocks Updates by multipla writers are propagated to the home node using dt 6 Cashmerc2L exploits the capabilities of a low latency remote write network Low latency remote write networks such as DEC s Memory Channel provide the possibility of transparent inexpensive huge scale shared memory parallel computing on clusters of shared memory multiprocessors SMPs The challenge is to take advantage of hardwaresharedmemoryfor sharing within an SMI and to ensure that software overheadis incurredonly when actively sharing data across SMPs in the cluster In this paper we describe a Ywolevel software coherent shared memory system Cashmere 2Lthat meets this challenge CashmereSL uses hardware to share memory within a node while exploiting the Memory Channel s remote write capabilities to implement moderately lazy release consistency with multiple concurrent writers directories home nodes and page size coherence blocks across nodes Cashmere2L employs a novel coherence protocol that allows a high level of asynchrony by eliminating global directory locks and the needfor TLB shootdown Remote interrupts are minimized by exploiting the remote write capabilities of the Memory Channel network Cashmere 2L currently runs on an node 32 processor DEC AlphaServersystem Speedups rangefrom 8 to 31 on 32processors for our benchmark suite depending on the application s characteristics We quanhfi the importance of ourprotocol optimizations by comparing perjormance to that of several alternative protocols that do not share memory in hardware within an SMP and require more synchronization In comparison to a one level protocol that does not share memory in hardware within an SMP Cashmere 2L improves performance by up to 46 This work was supported in part by NSF grants CDA 9401142 CCR 9319445 CCR 9409120 CCR 9702466 CCR 9705594 and CCR9510173 ARPA contract F19628 94 C 0057 an external research grant from Digital Equipment Corporation and a graduatefellowshipfrom Mi crosoftResearch Galen Hunt to maKe digital hard copy ot part or all this work for personai or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage the copyright notice the title of the publication and its date appear and notice is given that copying is by permission of ACM Inc TO COPY otherwise to republish to post on servers or to redistribute to lists requires prior specific permission and or a fee SOSP 16 10197 Saint Malo France 1997 ACM 0 89791 916 5 97 0010 3 50 IWmlSSlOn 170 apply these outgoing diffs without remote assistance and to implementlow costdirectories notificationqueues andapplication locks and barriers Cashmere 2L solves the problem of excess synchronization due to protocol operations within a node with a novel technique called hvo way dl ng it uses hvins pristine page copies and d comparisons of pristine and dirty copies not only to identify local changes that must be propagated to the home node outgoing diffs but also to identify remote changes that must be applied to local memory incoming diffs The coherence protocol is highly asynchronous it has no global directory locks no need for i a node TLB shootdown or related operations and only limited need for remote interrupts of any kind namely to fetch pages on a miss and to initiate sharing of pages that previously appeared to be private We have implemented Cashmere 2L on an S node 32 processor DEC AlphaServer cluster connected by a Memory Channel network Speedups for our benchmark suite range from 8 to 31 on 32 processors We have found that exploiting hardware coherence and memory sharing within SMP nodes can improve


View Full Document

UW-Madison CS 739 - Software Coherent Shared Memory on a Clustered Remote-Write Network

Loading Unlocking...
Login

Join to view Software Coherent Shared Memory on a Clustered Remote-Write Network and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Software Coherent Shared Memory on a Clustered Remote-Write Network and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?