Berkeley COMPSCI 258 - Hardware-Software Trade-offs in Synchronization and Data Layout - D2716492

Home> Schools> University of California, Berkeley> Computer Science (COMPSCI) > COMPSCI 258> Hardware-Software Trade-offs in Synchronization and Data Layout

Berkeley COMPSCI 258 - Hardware-Software Trade-offs in Synchronization and Data Layout

School name University of California, Berkeley

Course Compsci 258- Parallel Processors

Pages 5

Download Save

Unformatted text preview:

CS258 S991NOW Handout Page 1Hardware-Software Trade-offs inSynchronization and Data LayoutCS 258, Spring 99David E. CullerComputer Science DivisionU.C. Berkeley2/17/99 CS258 S99 2Role of Synchronization• “A parallel computer is a collection ofprocessing elements that cooperate andcommunicate to solve large problems fast.”• Types of Synchronization– Mutual Exclusion– Event synchronization» point-to-point» group» global (barriers)• How much hardware support?– high-level operations?– atomic instructions?– specialized interconnect?2/17/99 CS258 S99 3Mini-Instruction Set debate• atomic read-modify-write instructions– IBM 370: included atomic compare&swap formultiprogramming– x86: any instruction can be prefixed with a lock modifier– High-level language advocates want hardware locks/barriers» but it’s goes against the “RISC” flow,and has otherproblems– SPARC: atomic register-memory ops (swap, compare&swap)– MIPS, IBM Power: no atomic operations but pair ofinstructions» load-locked, store-conditional» later used by PowerPC and DEC Alpha too• Rich set of tradeoffs2/17/99 CS258 S99 4Other forms of hardware support• Separate lock lines on the bus• Lock locations in memory• Lock registers (Cray Xmp)• Hardware full/empty bits (Tera)• Bus support for interrupt dispatch2/17/99 CS258 S99 5Components of a Synchronization Event• Acquire method– Acquire right to the synch» enter critical section, go past event• Waiting algorithm– Wait for synch to become available when it isn’t– busy-waiting, blocking, or hybrid• Release method– Enable other processors to acquire right to the synch• Waiting algorithm is independent of type ofsynchronization– makes no sense to put in hardware2/17/99 CS258 S99 6Strawman Locklock: ld register, location /* copy location to register */cmp location, #0 /* compare with 0 */bnz lock /* if not 0, try again */st location, #1 /* store 1 to mark it locked */ret /* return control to caller */unlock: st location, #0 /* write 0 to location */ret /* return control to caller */Busy-WaitWhy doesn’t the acquire method work?Release method?CS258 S992NOW Handout Page 22/17/99 CS258 S99 7Atomic Instructions• Specifies a location, register, & atomic operation– Value in location read into a register– Another value (function of value read or not) stored intolocation• Many variants– Varying degrees of flexibility in second part• Simple example: test&set– Value in location read into a specified register– Constant 1 stored into location– Successful if value loaded into register is 0– Other constants could be used instead of 1 and 02/17/99 CS258 S99 8Simple Test&Set Locklock: t&s register, location bnz lock /* if not 0, try again */ret /* return control to caller */unlock: st location, #0 /* write 0 to location */ret /* return control to caller */• Other read-modify-write primitives– Swap– Fetch&op– Compare&swap»Three operands: location, register to compare with,register to swap with»Not commonly supported by RISC instruction sets• cacheable or uncacheable2/17/99 CS258 S99 9Performance Criteria for Synch. Ops• Latency (time per op)– especially when light contention• Bandwidth (ops per sec)– especially under high contention• Traffic– load on critical resources– especially on failures under contention• Storage• Fairness2/17/99 CS258 S99 10ssssssssssssssssllllllllllllllllnn nnnnnnnnnnnnnnuuuuuuuuuuuuuuuuNumber of processorsTime (µs)11 13 1502468101214161820sTest&set, c = 0lTest&set, exponential backoff, c = 3.64nTest&set, exponential backoff, c = 0uIdeal9753T&S Lock Microbenchmark: SGI Chal.lock;delay(c);unlock;• Why does performance degrade?• Bus Transactions on T&S?• Hardware support in CC protocol?2/17/99 CS258 S99 11Enhancements to Simple Lock• Reduce frequency of issuing test&sets whilewaiting– Test&set lock with backoff– Don’t back off too much or will be backed off when lockbecomes free– Exponential backoff works quite well empirically: ith time =k*ci• Busy-wait with read operations rather thantest&set– Test-and-test&set lock– Keep testing with ordinary load» cached lock variable will be invalidated when releaseoccurs– When value changes (to 0), try to obtain lock with test&set» only one attemptor will succeed; others will fail and starttesting again2/17/99 CS258 S99 12Improved Hardware Primitives: LL-SC• Goals:– Test with reads– Failed read-modify-write attempts don’t generate invalidations– Nice if single primitive can implement range of r-m-w operations• Load-Locked (or -linked), Store-Conditional– LL reads variable into register– Follow with arbitrary instructions to manipulate its value– SC tries to store back to location– succeed if and only if no other write to the variable since thisprocessor’s LL» indicated by condition codes;• If SC succeeds, all three steps happened atomically• If fails, doesn’t write or generate invalidations– must retry aquireCS258 S993NOW Handout Page 32/17/99 CS258 S99 13Simple Lock with LL-SClock: ll reg1, location /* LL location to reg1 */sc location, reg2 /* SC reg2 into location*/beqz reg2, lock /* if failed, start again */retunlock: st location, #0 /* write 0 to location */ret• Can do more fancy atomic ops by changing what’sbetween LL & SC– But keep it small so SC likely to succeed– Don’t include instructions that would need to be undone (e.g. stores)• SC can fail (without putting transaction on bus) if:– Detects intervening write even before trying to get bus– Tries to get bus but another processor’s SC gets bus first• LL, SC are not lock, unlock respectively– Only guarantee no conflicting write to lock variable between them– But can use directly to implement simple operations on shared variables2/17/99 CS258 S99 14Trade-offs So Far• Latency?• Bandwidth?• Traffic?• Storage?• Fairness?• What happens when several processorsspinning on lock and it is released?– traffic per P lock operations?2/17/99 CS258 S99 15Ticket Lock• Only one r-m-w per acquire• Two counters per lock (next_ticket, now_serving)– Acquire: fetch&inc next_ticket; wait for now_serving == next_ticket» atomic op when arrive at lock, not when it’s free (so lesscontention)– Release: increment now-serving• Performance– low latency for low-contention - if fetch&inc cacheable– O(p) read misses at release, since all spin on same variable–

View Full Document


School:
Email:
New Password:
Confirm Password:

Berkeley COMPSCI 258 - Hardware-Software Trade-offs in Synchronization and Data Layout

Sign up for free to view:

Please select your school