Berkeley COMPSCI 258 - Lazy Release Consistency for Software - D295790

Home> Schools> University of California, Berkeley> Computer Science (COMPSCI) > COMPSCI 258> Lazy Release Consistency for Software

DOC PREVIEW

Berkeley COMPSCI 258 - Lazy Release Consistency for Software

School name University of California, Berkeley

Course Compsci 258- Parallel Processors

Pages 9

This preview shows page 1-2-3 out of 9 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 9 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 9 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 9 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 9 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Lazy Release Consistencyfor Software Distributed Shared MemoryPete Keleher, Alan L. Cox, and Winy ZwaenepoelDepartment of Computer ScienceRice UniversityAbstractRelaxed memory consistency models,such as release consis-teracg, wereintroduced in order to reduce the impact of re-mote memory access latency in both software and hardwaredistributed shared memory(DSM). However, in a softwareDSM, it is also important to reduce the number of mes-sages and the amount of data exchanged for remote mem-ory access. Lazy release consistenc~ is a new algorithm forimplementing release consistency that lazily pulls modifica-tions across the interconnect only when necessary. Trace-driven simulation using the SPLASH benchmarks indicatesthat lazy release consistency reduces both the number ofmessages and the amount of data transferred between pro-cessors. These reductions are especially significant for pro-grams that exhibit false sharing and make extensive use oflocks.1 IntroductionOver the past few years, researchers in hardware distributedshared memory (DSM) have proposed relaxed memory con-sistency models to reduce the latenc~ associated with re-mote memory accesses [1, 8, 9, 10, 14]. For instance, inrelease consistency (RC) [9], writes to shared memory byprocessor pl need to be performed (become visible) at an-other processor pz only when a subsequent release of plperforms at pz. This relaxation of the memory consistencymodel allows the DASH implementation of RC [12] to conl-bat memory latency by pipelining writes to shared memory(see Figure 1). The processor is stalled only when execut-ing a release, at which time it must wait for all its previouswrites to perform.In software DSMS, it is also important to reduce the num-ber of messages exchanged. Sending a message in a softwareDSM is more expensive than in a hardware DSM, becauseit may involve traps into the operating system kernel, inter-rupts, context switches, and the execution of several layersThk work is supported in part by NSF Grant No.CDA-8619893 and Texas ATP Grant No. 0036404013. Pete Kele-her was supported by a NASA Fellowship.Permission to copy without fee all or part of this material M grantedprowded that the copies are not made or distributed for dmect commercialadvantage, the ACM copyright notice and the htle of the publication andits date appear. and notice is given that copying is by permission of theAssociationfor Computmg Machurery. To copy otherwise, or to republish,requires a fee and/or speclflc permissionw(x)WY)w(z) rel:~Figure 1: Pipelining Remote Memory Accesses in DASH.of networking software. Ideally, the number of messagesexchanged in a software DSM should equal the number ofmessages exchanged in a message passing implementation ofthe same application. Therefore, Munin’s write-shared pro-tocol [6], a software implementation of RC, buffers writesuntil a release, instead of pipelining them as in the DASHimplement ation. At the release, all writes going to the samedestination are merged into a single message (see Figure 2).Even Munin’s write-shared protocol may send more mes-sages than a message passing implementation of the sameapplication. Consider the example of Figure 3, where pro-cessors PI through P1 repeatedly acquire the lock 1, write theshared variable x, and then release 2. If an update policyis used in conjunction with Munin’s write-shared protocol,and z is present in all caches, then all of these cached copiesare updated at every release. Logically, however, it sufficesto update each processor’s copy only when it acquires 1.This results in a single message exchange per acquire, asin a message passing implementation. This problem is notpeculiar to the use of an update policy. Similar examplescan be constructed for an inwdidate policy.Lazg reiease consistency (LRC) is a new algorithm forimplementing RC, aimed at reducing both the number ofmessages and the amount of data exchanged. Unlike ea-ger algorithms such as Munin’s write-shared protocol, Zazgalgorithms such as LRC do not make modifications glob-ally visible at the time of a release. Instead, LRC guaran-W(x)W(Y)w(z) relpl,, 1, , ,\*Figure 2: Merging of Remote Memory Updates in Munin.@ 1992 ACM 0-89791 .509-7/92/0005/001 3$1.50 13w(x) relplp2p3\\ \Figure 3: Repeated Updates of Cached Copies in Eager RC.tees only that a processor that acquires a lock will see allmodifications that “precede” the lock acquire. The term“preceding” in this context is to be interpreted in the tran-sitive sense: informally, a modification precedes an acquire,if it occurs before any release such that there is a chainof release-acquire operations on the same lock, ending withthe current acquire (see Section 4 for a precise definition).For instance, in Figure 3, all modifications that occur inprogram order before any of the releases in PI through PSprecede the lock acquisition in P4. With LRC, modifica-tions are propagated at the time of an acquire. Only themodifications that “precede” the acquire are sent to the ac-quiring processor. The modifications can be piggybacked onthe message that grants the lock, further reducing messagetraffic. Figure 4 shows the message traffic in LRC for thesame shared data accesses as in Figure 3. / and z are sentin a single message at each acquire.By not propagating modifications globally at the time ofthe release, and by piggybacking data movement on locktransfer messages, LRC reduces both the number of mes-sages and the amount of dat a exchanged. We present the re-sults of a simulation study, using the SPLASH benchmarks,that confirms this intuition. LRC is, however, more complexto implement than eager RC because it must keep track ofthe “precedes” relation. We intend to implement LRC toevaluate its runtime cost. The message and data reductionsseen in our simulations seem to indicate that LRC will out-perform eager RC in a software DSM environment.The outline of the rest of this paper is as follows. InSection 2, we state the definition of RC. In Section 3, wepresent an eager implementation of RC based on Munin’swrite-shared protocol. In Section 4, we define LRC and out-line its implementation. In Section 5, we describe a compar-ison through simulation of eager RC and LRC. We brieflydiscuss related work in Section 6, and we draw conclusionsand explore avenues for further work in Section 7.w(x) relplp4\acq r(x)-Figure 4: Message Traffic in LRC.2Release ConsistencyRelease consistency (RC) [9] is a form of relaxed

View Full Document