CMU CS 15745 - Optimizing Memory Transactions - D2060286

Home> Schools> Carnegie Mellon University> Computer Science (CS) > CS 15745> Optimizing Memory Transactions

DOC PREVIEW

CMU CS 15745 - Optimizing Memory Transactions

School name Carnegie Mellon University

Course Cs 15745- Optimizing Compilers for Modern Architectures

Pages 12

This preview shows page 1-2-3-4 out of 12 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 12 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 12 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 12 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 12 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 12 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Optimizing Memory TransactionsTim Harris†Mark Plesko†Avraham Shinnar‡David Tarditi†Microsoft Research†Harvard University‡[email protected] [email protected] [email protected] [email protected] blocks allow programmers to delimit sections of code as‘atomic’, leaving the language’s implementation to enforce atomic-ity. Existing work has shown how to implement atomic blocks overword-based transactional memory that provides scalable multi-processor performance without requiring changes to the basicstructure of objects in the heap. However, these implementationsperform poorly because they interpose on all accesses to sharedmemory in the atomic block, redirecting updates to a thread-privatelog which must be searched by reads in the block and later recon-ciled with the heap when leaving the block.This paper takes a four-pronged approach to improving perfor-mance: (1) we introduce a new ‘direct access’ implementation thatavoids searching thread-private logs, (2) we develop compiler op-timizations to reduce the amount of logging (e.g. when a threadaccesses the same data repeatedly in an atomic block), (3) we useruntime filtering to detect duplicate log entries that are missed stati-cally, and (4) we present a series of GC-time techniques to compactthe logs generated by long-running atomic blocks.Our implementation supports short-running scalable concurrentbenchmarks with less than 50% overhead over a non-thread-safebaseline. We support long atomic blocks containing millions ofshared memory accesses with a 2.5-4.5x slowdown.Categories and Subject Descriptors D.3.3 [Programming Lan-guages]: Language Constructs and Features—Concurrent program-ming structuresGeneral Terms Algorithms, Languages, PerformanceKeywords Atomicity, Critical Regions, Transactional Memory1. IntroductionAtomic blocks provide a promising simplification to the problem ofwriting concurrent programs [12]. A code block is marked atomicand the compiler and runtime system ensure that operations withinthe block, including function calls, appear atomic. The program-mer no longer needs to worry about manual locking, low-level raceconditions, or deadlocks. Atomic blocks can also provide exceptionrecovery, whereby a block’s side effects are rolled back if an excep-tion terminates it [13]. This is valuable even in a single-threadedapplication: error handling code is often difficult to write and totest [29]. Implementations of atomic blocks scale to large multi-Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. To copy otherwise, to republish, to post on servers or to redistributeto lists, requires prior specific permission and/or a fee.PLDI’06June 11–14, 2006, Ottawa, Ontario, Canada.Copyrightc 2006 ACM 1-59593-320-4/06/0006... $5.00.processor machines [12] because they are parallelism preserving:atomic blocks can execute concurrently so long as a location beingupdated in one block is not being accessed in any of the others. Thispreserves the kind of sharing allowed in a conventional data cache.Although they scale well, current implementations of atomicblocks introduce substantial runtime overhead [12]. They are builtusing word-based software transactional memory (STM) whichallows a series of memory accesses made via the STM library to beperformed atomically. There are three main reasons for the runtimeoverhead, which we discuss in more detail in Section 2:•STM implementations typically create privateshadow copiesof memory updated in atomic blocks. This introduces lookupson all read operations in atomic blocks and slows down writeoperations when there is no contention. Furthermore, the cost ofthese lookups precludes the use of atomic blocks across longer-running sections of code.•STM is implemented asa library. Calls to STM operations areintroduced late in compilation and are treated as opaque calls.This misses many optimization opportunities.•STM operations are used unnecessarily. Accesses to heapdata are blindly redirected through the STM without consid-eration of whether or not an object is thread-local.We address these problems with a novel STM implementation thatis more tightly integrated with the compiler and runtime system.We make a number of contributions:•Direct-access STM. Our STM is the first to allow objects tobe updated directly in the heap rather than working on privateshadow copies of objects, or via extra levels of indirectionbetween an object reference and the current object contents.This optimizes for transactions that commit successfully.•Decomposed STM interface. Section 3 describes how we de-compose transactional memory operations to expose opportu-nities for classical optimizations. For instance, a transactionalstore obj.field = x is split into steps that (a) record that objis being updated by the current thread, (b) log the old value thatfield held, and (c) store the new value x into the field. Thesethree steps are then handled separately by the compiler and (a)and (b) can often be hoisted from a loop while (c) cannot.•Compile-time optimizations. Section 4 describes additionaloptimizations to reduce the number of calls to the STM inter-face. For instance, by further decomposing the logging opera-tions we can amortize the cost of checking for space across aseries of stores into the log.•Integrated transactional versioning. Our STM is the firstto integrate transactional versioning with an existing objectheader word. Earlier STMs, even those integrated in a man-aged runtime environment, either used external tables of ver-sioning records [12], additional header words [13], or madeprogrammer-visible changes to the object model to add levelsof indirection between object references and current object con-tents [16, 10, 23].•Runtime filtering. Not all unnecessary operations can be iden-tified statically, so we add complementary runtime filtering –e.g. to remove updates to transaction-local objects. (Section 5).•Garbage collection (GC) integration. Our implementation isthe first to allow the GC to reclaim objects that become unreach-able within a still-running transaction; earlier work would holdonto such objects until the transaction that allocated them hascommitted or aborted. (Section 6).Our work is

View Full Document