DOC PREVIEW
Berkeley COMPSCI 258 - Efficient Synchronization: Let Them Eat QOLB

This preview shows page 1-2-3-4 out of 11 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 11 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 11 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 11 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 11 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 11 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Efficient Synchronization: Let Them Eat QOLB’ Alain K8gi, Doug Burger, and James R. Goodman Computer Sciences Department University of Wisconsin-Madison 1210 West Dayton Street Madison, Wisconsin 53706 USA [email protected] - http://www.cs.wisc.edu/-9alileo Abstract EjFcient synchronization primitives are essential for achieving high performance in he-grain, shared-memory parallel pro- grams. One function of synchronization primitives is to enable exclusive access to shared data and cn’tical sections of code. This paper makes three contributions. (1) We enumerate thej?ve sources of overhead that locking synchronization primitives can incul: (2) We describe four mechanisms (local spinning, queue-based lock- ing, collocation, and synchronizedprefetch) that reduce these syn- chronization overheads. (3) u”ith detailed simulations, we show the extent to which these four mechanisms can improve the perfor- mance of shared-memory programs. We evaluate the space of these mechanisms using seventeen synchronization constructs, which are formed from six base types of locks (T.!?.YT&Sn, Ti?iT&TEsrdiSR; MCS, LH, M, and QoD). We show that large performance gains (speedups of more than 1.5 for three ofjive benchmarks) can be achieved ifat least three optimizing mechanisms are used simulta- neously. We jnd that QOL& which incorporates all four mecha- nisms, outperforms all other primitives (including reactive synchronization) in all cases. Finally, we demonstrate the superior performance of a low-cost implementation of Qom, which runs on an unmodified cluster of commodiry workstations. 1 Introduction Shared-memory multiprocessors are rapidly becoming the machines of choice for solving large, fine-grained scientific pro- grams. Multiple factors support this trend. The advent of afford- able desktop symmetric multiprocessors (SMPs) will increase the application base. The successful development of shared-memory multiprocessing standards [43] reduce the time to market by decreasing design time and by letting manufacturers use commod- ity parts. Both the Convex Exemplar [7] and the Sequent STING [27] relied on these standards. The emergence of low-cost, fine 1. Pronounced “Colby.’ This work is supported in part by NSF Grant CCR-9509589 and by the Intel Research Council. Permission to make digital/hard copy of part or all this work tar personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advan- tage, the copyright notice, the title of the publication and its date appear, and notice is given that copying is by permission of ACM. Inc. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee. ISCA ‘97 Denver, CO, USA 0 1997 ACM 0-89791-901-7/97/0006...$3.50 grain software implementations of shared-memory, such as SHASTA [38] or TO 1351 further reduce the cost of supporting the shared- memory model. Finally, successful research prototypes such as the Stanford DASH [25] have shown that this class of machines can obtain excellent speedups for a wide range of programs that use fine-grained communication. Traditional message-passing programming models force the programmer to embed implicit synchronization with each commu- nication of data. Such a requirement restricts the parallclization strategy--dynamic task distribution becomes extremely difficult, for example. The shared-memory programming model, conversely, uses cache coherence protocols to keep shared data consistent. The programmer judiciously employs explicit synchronization to pro- vide mutual exclusion for data and code, as well as synchronlzlng processors between phases of computation. The two major classes of explicit synchronization opemtions fn shared-memory multiprocessors are barriers and locks. Although barriers are important to efficient shared-memory programs, they are well-understood, and many efficient implementations hnvc been proposed and/or built [15, 20, 23, 32, 441. In this study, WC focus on providing more efficient mutual exclusion through better locks. Locks provide individual processors with exclusive access to shared data and a critical section of code. This exclusive access is particularly well-suited to the fine-grained nature of many shnred- memory parallel programs. Fine-grained programs ideally associ- ate as little data or code as possible with a critical section, mini- mizing serialized processing, thus maximizing nvnilnblc parallelism. Since access to critical sections is by definition scrinl- ized among processors, large overheads when accessing a con- tested critical section degrade both parallel performnncc and potential scalability. To maximize both the performance of fine- grain parallel applications that use locking, and the potential to scale to larger numbers of processors, we must minimize the delays associated with the transfer of exclusively accessed resources. The act of transferring control of a critical section is a complex one, that may involve multiple remote transactions. Complex pro- tocols have been proposed that perform this transfer efficiently, &owing reasonable performance when there is high contention for a lock. The compIexity of these protocols causes unnecessary delays when accessing a lock that is not held. Conversely, simple Iocking schemes that can access a free lock quickly mny perform poorly in the presence of contention. This fundamental trndc-off has resulted in proposals of numerous primitives in the literature [3,13,16,26,28,30,37]. This paper contains a detailed and thorough evaluation of n range of locking primitives. To understand where the opportunities for optimization lie, we first decompose the time associated with n 170complete locking period into three phases: Transfer, Load/Corn- pute, and Release. Together, these phases form a synchronization period, which determines the global throughput of synchronization operations and thus determines scalability for codes that rely


View Full Document

Berkeley COMPSCI 258 - Efficient Synchronization: Let Them Eat QOLB

Documents in this Course
Load more
Download Efficient Synchronization: Let Them Eat QOLB
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Efficient Synchronization: Let Them Eat QOLB and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Efficient Synchronization: Let Them Eat QOLB 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?