SynchronizationTodd C. MowryCS 740November 24, 1998Topics• Locks• BarriersCS 740 F’98– 2 –Types of SynchronizationMutual Exclusion• LocksEvent Synchronization• Global or group-based (barriers)• Point-to-point–tightly coupled with data (full-empty bits, futures)–separate from data (flags)CS 740 F’98– 3 –Blocking vs. Busy-WaitingBusy-waiting is preferable when:• scheduling overhead is larger than expected wait time• processor resources are not needed for other tasks• schedule-based blocking is inappropriate (e.g., in OS kernel)But typical busy-waiting produces lots of networktraffic• hot-spotsCS 740 F’98– 4 –Reducing Busy-Waiting TrafficTrend was toward increasing hardware support• sophisticated atomic operations• combining networks• multistage networks with special sync variables in switches• special-purpose cache hardware to maintain queues of waitersBut (Mellor-Crummey and Scott),Appropriate software synchronization algorithms can eliminate mostbusy-waiting traffic.CS 740 F’98– 5 –Software SynchronizationHardware support required• simple fetch-and-op support• ability to “cache” shared variablesImplications• efficient software synch can be designed for large machines• special-purpose hardware has only small additional benefits–small constant factor for locks–best case factor of log P for barriersCS 740 F’98– 6 –Mutual ExclusionDiscuss four sets of primitives:• test-and-set lock• ticket lock• array-based queueing lock• list-based queueing lockCS 740 F’98– 7 –Test and Set LockCacheable vs. non-cacheableCacheable:• good if locality on lock access–reduces both latency and lock duration–particularly if lock acquired in exclusive state• not good for high-contention locksNon-cacheable:• read-modify-write at main memory• easier to implement• high latency• better for high-contention locksCS 740 F’98– 8 –Test and Test and SetA: while (lock != free)if (test&set(lock) == free) {critical section;}else goto A;(+) spinning happens in cache(-) can still generate a lot of traffic when many processors go to dotest&setCS 740 F’98– 9 –Test and Set with BackoffUpon failure, delay for a while before retrying• either constant delay or exponential backoffTradeoffs:(+) much less network traffic(-) exponential backoff can cause starvation for high-contention locks–new requestors back off for shorter timesBut exponential found to work best in practiceCS 740 F’98– 10 –Test and Set with UpdateTest and Set sends updates to processors that cachethe lockTradeoffs:(+) good for bus-based machines(-) still lots of traffic on distributed networksMain problem with test&set-based schemes is that alock release causes all waiters to try to get the lock,using a test&set to try to get it.CS 740 F’98– 11 –Ticket Lock (fetch&incr based)Ticket lock has just one proc do a test&set when a lockis released.Two counters:• next_ticket (number of requestors)• now_serving (number of releases that have happened)Algorithm:• First do a fetch&incr on next_ticket (not test&set)• When release happens, poll the value of now_serving–if my_ticket, then I winUse delay; but how much?• exponential is bad• tough to find a delay that makes network traffic constant for a singlelockCS 740 F’98– 12 –Ticket Lock Tradeoffs(+) guaranteed FIFO order; no starvation possible(+) latency can be low if fetch&incr is cacheable(+) traffic can be quite low(-) but traffic is not guaranteed to be O(1) per lockacquireCS 740 F’98– 13 –Array-Based Queueing LocksEvery process spins on a unique location, rather thanon a single no_serving counterfetch&incr gives a process the address on which tospinTradeoffs:(+) guarantees FIFO order (like ticket lock)(+) O(1) traffic with coherence caches (unlike ticket lock)(-) requires space per lock proportional to PCS 740 F’98– 14 –List-Base Queueing Locks (MCS)All other good things + O(1) traffic even withoutcoherent caches (spin locally)Uses compare&swap to build linked lists in softwareLocally-allocated flag per list node to spin onCan work with fetch&store, but loses FIFO guaranteeTradeoffs:(+) less storage than array-based locks(+) O(1) traffic even without coherent caches(-) compare&swap not easy to implementCS 740 F’98– 15 –Lock Performance(See attached sheets.)CS 740 F’98– 16 –Implementing Fetch&Op PrimitivesOne possibility for implementation in caches: LL/SC• “Load Linked” (LL) loads the lock and sets a bit• When “atomic” operation is finished, “Store Conditional” (SC)succeeds only if bit was not reset in interim• Fits within the “load/store” (aka RISC) architecture paradigm–I.e. no instruction performs more than one memory operation–good for pipelining and clock ratesGood for bus-based machines: SC result known on busMore complex for directory-based machines:• wait for SC to go to directory and get ownership (long latency)• have LL load in exclusive mode, so SC succeeds immediately if stillin exclusive modeCS 740 F’98– 17 –Bottom Line for LocksLots of optionsUsing simple hardware primitives, we can build verysuccessful algorithms in software• LL/SC works well if there is locality of synch accesses• Otherwise, in-memory fetch&ops are good for high contentionCS 740 F’98– 18 –Lock RecommendationsCriteria:• scalability• one-processor latency• space requirements• fairness• atomic operation requirementsCS 740 F’98– 19 –ScalabilityMCS locks most scalable in general• array-based queueing locks also good with coherent cachesTest&set and Ticket locks scale well with backoff• but add network trafficCS 740 F’98– 20 –Single-Processor LatencyTest&set and Ticket locks are bestMCS and Array-based queueing locks also good whengood fetch&op primitives are available• array-based bad in terms of space if too many locks needed– in OS kernel, for example–when more processes than processorsCS 740 F’98– 21 –FairnessTicket lock, array-based lock and MCS all guaranteeFIFO• MCS needs compare&swapFairness can waste CPU resources in case ofpreemption• can be fixed by coschedulingCS 740 F’98– 22 –Primitives Required:Test&Set:• test&setTicket:• fetch&incrMCS:• fetch&store• compare&swapQueue-based:• fetch&storeCS 740 F’98– 23 –Lock RecommendationsFor scalability and minimizing network traffic:• MCS–one-proc latency good if efficient fetch&op providedIf one-proc
View Full Document