Berkeley COMPSCI 258 - Is SC + ILP = RC - D2241855

Home> Schools> University of California, Berkeley> Computer Science (COMPSCI) > COMPSCI 258> Is SC + ILP = RC

DOC PREVIEW

Berkeley COMPSCI 258 - Is SC + ILP = RC

School name University of California, Berkeley

Course Compsci 258- Parallel Processors

Pages 10

This preview shows page 1-2-3 out of 10 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 10 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 10 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 10 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 10 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Is SC + ILP = RC?Chris Gniady, Babak Falsafi, and T. N. VijaykumarSchool of Electrical & Computer EngineeringPurdue University1285 EE BuildingWest Lafayette, IN [email protected], http://www.ece.purdue.edu/~impetusAbstractSequential consistency (SC) is the simplest program-ming interface for shared-memory systems but imposesprogram order among all memory operations, possibly pre-cluding high performance implementations. Release con-sistency (RC), however, enables the highest performanceimplementations but puts the burden on the programmer tospecify which memory operations need to be atomic and inprogram order. This paper shows, for the first time, that SCimplementations can perform as well as RC implementa-tions if the hardware provides enough support for specula-tion. Both SC and RC implementations rely on reorderingand overlapping memory operations for high performance.To enforce order when necessary, an RC implementationuses software guarantees, whereas an SC implementationrelies on hardware speculation. Our SC implementation,called SC++, closes the performance gap because: (1) thehardware allows not just loads, as some current SC imple-mentations do, but also stores to bypass each other specula-tively to hide remote latencies, (2) the hardware provideslarge speculative state for not just processor, as previouslyproposed, but also memory to allow out-of-order memoryoperations, (3) the support for hardware speculation doesnot add excessive overheads to processor pipeline criticalpaths, and (4) well-behaved applications incur infrequentrollbacks of speculative execution. Using simulation, weshow that SC++ achieves an RC implementation’s perfor-mance in all the six applications we studied.1 IntroductionMultiprocessors are becoming widely available in allsectors of the computing market from desktops to high-endservers. To simplify programming multiprocessors, manyvendors implement shared memory as the primary system-level programming abstraction. To achieve high perfor-mance, the shared-memory abstraction is typically imple-mented in hardware. Shared-memory systems come with avariety of programming interfaces—also known as mem-ory consistency models—offering a trade-off between pro-gramming simplicity and high performance.Sequential consistency (SC) is the simplest and mostintuitive programming interface [9]. An SC-compliantmemory system appears to execute memory operations oneat a time in program order. SC’s simple memory behavioris what programmers often expect from a shared-memorymultiprocessor because of its similarity to the familiar uni-processor memory system. Traditionally, SC is believed topreclude high performance because conventional SCimplementations would conservatively impose orderamong all memory operations to satisfy the requirementsof the model. Such implementations would be prohibitivelyslow especially in distributed shared memory (DSM)where remote memory accesses can take several timeslonger than local memory accesses.To mitigate performance impact of long latency opera-tions in shared memory and to realize the raw performanceof the hardware, researchers and system designers haveinvented several relaxed memory models [3,2,6]. Relaxedmemory models significantly improve performance overconventional SC implementations by requiring only somememory operations to perform in program order. By other-wise overlapping some or all other memory operations,relaxed models hide much of the memory operations’ longlatencies. Relaxed models, however, complicate the pro-gramming interface by burdening the programmers withthe details of annotating memory operations to specifywhich operations must execute in program order.Modern microprocessors employ aggressive instructionexecution mechanisms to extract larger levels of instructionlevel parallelism (ILP) and reduce program execution time.To maximize ILP, these mechanisms allow instructions toexecute both speculatively and out of program order. TheILP mechanisms buffer the speculative state of suchinstructions to maintain sequential semantics upon a mis-speculation or an exception. The ILP mechanisms havereopened the debate about the memory models becausethey enable SC implementations to relax speculatively thememory order and yet appear to execute memory opera-tions atomically and in program order [5,14,7].An aggressive SC implementation can speculativelyperform all memory operations in a processor cache. Suchan implementation rolls back to the ‘‘sequentially-consis-tent’’ memory state if another processor is about to observethat the model constraints are violated (e.g., a store by oneprocessor to a memory block loaded speculatively out oforder by another). In the absence of frequent rollbacks, anSC implementation can perform potentially as well as thebest of relaxed models—Release Consistency (RC)—because it emulates an RC implementation’s behavior inevery other aspect.1063-6897/99/$10.00 (c) 1999 IEEE162Gharachorloo et al., [5] first made the observation thatexploiting ILP mechanisms allows optimizing SC’s per-formance. Their proposed techniques are implemented inHP PA-8000, Intel Pentium Pro, and MIPS R10000. Ran-ganathan et al., re-evaluated these techniques [13] and pro-posed further optimizations [14] but concluded that asignificant gap between SC and RC implementationsremains for some applications and identified some of thefactors contributing to the difference. Hill [7], however,argues that with current trends towards larger levels of on-chip integration, sophisticated microarchitectural innova-tion, and larger caches, the performance gap between thememory models should eventually vanish.This paper confirms Hill’s conjecture by showing, forthe first time, that an SC implementation can perform aswell as an RC implementation if the hardware providesenough support for speculation. The key observation isthat both SC and RC implementations rely on reorderingand overlapping memory operations to achieve high per-formance. While RC implementations primarily use soft-ware guarantees to enforce program order only whennecessary, SC implementations rely on hardware specula-tion to provide the guarantee. So long as hardware specu-lation enables SC implementations to relax all memoryorders speculatively and “emulate” RC implementations,SC implementations can reach RC implementations’ per-formance. Any shortcoming in the hardware support forspeculation prevents SC implementations from

View Full Document


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-3 out of 10 pages.

Berkeley COMPSCI 258 - Is SC + ILP = RC

Sign up for free to view:

Please select your school