Berkeley COMPSCI 258 - Efficient Synchronization: Let Them Eat QOLB - D625782

Home> Schools> University of California, Berkeley> Computer Science (COMPSCI) > COMPSCI 258> Efficient Synchronization: Let Them Eat QOLB

DOC PREVIEW

Berkeley COMPSCI 258 - Efficient Synchronization: Let Them Eat QOLB

School name University of California, Berkeley

Course Compsci 258- Parallel Processors

Pages 11

This preview shows page 1-2-3-4 out of 11 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 11 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 11 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 11 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 11 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 11 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Efficient Synchronization: Let Them Eat QOLB’ Alain K8gi, Doug Burger, and James R. Goodman Computer Sciences Department University of Wisconsin-Madison 1210 West Dayton Street Madison, Wisconsin 53706 USA [email protected] - http://www.cs.wisc.edu/-9alileo Abstract EjFcient synchronization primitives are essential for achieving high performance in he-grain, shared-memory parallel pro- grams. One function of synchronization primitives is to enable exclusive access to shared data and cn’tical sections of code. This paper makes three contributions. (1) We enumerate thej?ve sources of overhead that locking synchronization primitives can incul: (2) We describe four mechanisms (local spinning, queue-based lock- ing, collocation, and synchronizedprefetch) that reduce these syn- chronization overheads. (3) u”ith detailed simulations, we show the extent to which these four mechanisms can improve the perfor- mance of shared-memory programs. We evaluate the space of these mechanisms using seventeen synchronization constructs, which are formed from six base types of locks (T.!?.YT&Sn, Ti?iT&TEsrdiSR; MCS, LH, M, and QoD). We show that large performance gains (speedups of more than 1.5 for three ofjive benchmarks) can be achieved ifat least three optimizing mechanisms are used simulta- neously. We jnd that QOL& which incorporates all four mecha- nisms, outperforms all other primitives (including reactive synchronization) in all cases. Finally, we demonstrate the superior performance of a low-cost implementation of Qom, which runs on an unmodified cluster of commodiry workstations. 1 Introduction Shared-memory multiprocessors are rapidly becoming the machines of choice for solving large, fine-grained scientific pro- grams. Multiple factors support this trend. The advent of afford- able desktop symmetric multiprocessors (SMPs) will increase the application base. The successful development of shared-memory multiprocessing standards [43] reduce the time to market by decreasing design time and by letting manufacturers use commod- ity parts. Both the Convex Exemplar [7] and the Sequent STING [27] relied on these standards. The emergence of low-cost, fine 1. Pronounced “Colby.’ This work is supported in part by NSF Grant CCR-9509589 and by the Intel Research Council. Permission to make digital/hard copy of part or all this work tar personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advan- tage, the copyright notice, the title of the publication and its date appear, and notice is given that copying is by permission of ACM. Inc. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee. ISCA ‘97 Denver, CO, USA 0 1997 ACM 0-89791-901-7/97/0006...$3.50 grain software implementations of shared-memory, such as SHASTA [38] or TO 1351 further reduce the cost of supporting the shared- memory model. Finally, successful research prototypes such as the Stanford DASH [25] have shown that this class of machines can obtain excellent speedups for a wide range of programs that use fine-grained communication. Traditional message-passing programming models force the programmer to embed implicit synchronization with each commu- nication of data. Such a requirement restricts the parallclization strategy--dynamic task distribution becomes extremely difficult, for example. The shared-memory programming model, conversely, uses cache coherence protocols to keep shared data consistent. The programmer judiciously employs explicit synchronization to pro- vide mutual exclusion for data and code, as well as synchronlzlng processors between phases of computation. The two major classes of explicit synchronization opemtions fn shared-memory multiprocessors are barriers and locks. Although barriers are important to efficient shared-memory programs, they are well-understood, and many efficient implementations hnvc been proposed and/or built [15, 20, 23, 32, 441. In this study, WC focus on providing more efficient mutual exclusion through better locks. Locks provide individual processors with exclusive access to shared data and a critical section of code. This exclusive access is particularly well-suited to the fine-grained nature of many shnred- memory parallel programs. Fine-grained programs ideally associ- ate as little data or code as possible with a critical section, mini- mizing serialized processing, thus maximizing nvnilnblc parallelism. Since access to critical sections is by definition scrinl- ized among processors, large overheads when accessing a con- tested critical section degrade both parallel performnncc and potential scalability. To maximize both the performance of fine- grain parallel applications that use locking, and the potential to scale to larger numbers of processors, we must minimize the delays associated with the transfer of exclusively accessed resources. The act of transferring control of a critical section is a complex one, that may involve multiple remote transactions. Complex pro- tocols have been proposed that perform this transfer efficiently, &owing reasonable performance when there is high contention for a lock. The compIexity of these protocols causes unnecessary delays when accessing a lock that is not held. Conversely, simple Iocking schemes that can access a free lock quickly mny perform poorly in the presence of contention. This fundamental trndc-off has resulted in proposals of numerous primitives in the literature [3,13,16,26,28,30,37]. This paper contains a detailed and thorough evaluation of n range of locking primitives. To understand where the opportunities for optimization lie, we first decompose the time associated with n 170complete locking period into three phases: Transfer, Load/Corn- pute, and Release. Together, these phases form a synchronization period, which determines the global throughput of synchronization operations and thus determines scalability for codes that rely

View Full Document